All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/9] mm/demotion: Memory tiers and demotion
@ 2022-06-03 13:42 Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                   ` (10 more replies)
  0 siblings, 11 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
  tier hierarchy in order to optimize its memory allocations.

This patch series make the creation of memory tiers explicit under
the control of userspace or device driver.

Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier (1).
The default tier device has a rank value (200).

A device driver can move up or down its memory nodes from the default
tier.  For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Memory Allocation for Demotion
==============================
This patch series keep the demotion target page allocation logic same.
The demotion page allocation pick the closest NUMA node in the
next lower tier to the current NUMA node allocating pages from.

This will be later improved to use the same page allocation strategy
using fallback list.

Sysfs Interface:
-------------
Listing current list of memory tiers and rank details:

:/sys/devices/system/memtier$ ls
default_tier max_tier  memtier1  power  uevent
:/sys/devices/system/memtier$ cat default_tier
memtier1
:/sys/devices/system/memtier$ cat max_tier 
3
:/sys/devices/system/memtier$ 

Per node memory tier details:

For a cpu only NUMA node:

:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# echo 1 > node0/memtier 
:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# 

For a NUMA node with memory:
:/sys/devices/system/node# cat node1/memtier 
1
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  power  uevent
:/sys/devices/system/node# echo 2 > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  memtier2  power  uevent
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# 
:/sys/devices/system/node# cat ../memtier/memtier2/rank 
100
:/sys/devices/system/node# 
:/sys/devices/system/node# cat ../memtier/memtier1/rank 
200
:/sys/devices/system/node#

Removing a NUMA node from demotion:
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# echo none > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# cat node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_tier  max_tier  memtier1  power  uevent
:/sys/devices/system/node# 

The above also resulted in removal of memtier2 which was created in the earlier step.


Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.

Aneesh Kumar K.V (7):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Expose per node memory tier to sysfs
  mm/demotion: Move memory demotion related code
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  mm/demotion: Add support for removing node from demotion memory tiers
  mm/demotion: Update node_is_toptier to work with memory tiers

Jagdish Gediya (2):
  mm/demotion: Demote pages according to allocation fallback order
  mm/demotion: Add documentation for memory tiering

 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/memory-tiering.rst         | 175 +++++
 drivers/base/node.c                           |  43 ++
 drivers/dax/kmem.c                            |   4 +
 include/linux/memory-tiers.h                  |  54 ++
 include/linux/migrate.h                       |  15 -
 include/linux/node.h                          |   5 -
 mm/Kconfig                                    |  11 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   1 +
 mm/memory-tiers.c                             | 706 ++++++++++++++++++
 mm/migrate.c                                  | 453 +----------
 mm/mprotect.c                                 |   1 +
 mm/vmscan.c                                   |  39 +-
 mm/vmstat.c                                   |   4 -
 15 files changed, 1017 insertions(+), 496 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

-- 
2.36.1


^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-07 18:43   ` Tim Chen
                     ` (2 more replies)
  2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch introduce explicity memory tiers with ranks. The rank
value of a memory tier is used to derive the demotion order between
NUMA nodes. The memory tiers present in a system can be found at

/sys/devices/system/memtier/memtierN/

The nodes which are part of a specific memory tier can be listed
via
/sys/devices/system/memtier/memtierN/nodelist

"Rank" is an opaque value. Its absolute value doesn't have any
special meaning. But the rank values of different memtiers can be
compared with each other to determine the memory tier order.

For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
their rank values are 300, 200, 100, then the memory tier order is:
memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
and memtier1 is the lowest tier.

The rank value of each memtier should be unique.

A higher rank memory tier will appear first in the demotion order
than a lower rank memory tier. ie. while reclaim we choose a node
in higher rank memory tier to demote pages to as compared to a node
in a lower rank memory tier.

For now we are not adding the dynamic number of memory tiers.
But a future series supporting that is possible. Currently
number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
When doing memory hotplug, if not added to a memory tier, the NUMA
node gets added to DEFAULT_MEMORY_TIER(1).

This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  20 ++++
 mm/Kconfig                   |  11 ++
 mm/Makefile                  |   1 +
 mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
 4 files changed, 220 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..e17f6b4ee177
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_TIERED_MEMORY
+
+#define MEMORY_TIER_HBM_GPU	0
+#define MEMORY_TIER_DRAM	1
+#define MEMORY_TIER_PMEM	2
+
+#define MEMORY_RANK_HBM_GPU	300
+#define MEMORY_RANK_DRAM	200
+#define MEMORY_RANK_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIERS  3
+
+#endif	/* CONFIG_TIERED_MEMORY */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..08a3d330740b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config TIERED_MEMORY
+	bool "Support for explicit memory tiers"
+	def_bool n
+	depends on MIGRATION && NUMA
+	help
+	  Support to split nodes into memory tiers explicitly and
+	  to demote pages on reclaim to lower tiers. This option
+	  also exposes sysfs interface to read nodes available in
+	  specific tier and to move specific node among different
+	  possible tiers.
+
 config HUGETLB_PAGE_SIZE_VARIABLE
 	def_bool n
 	help
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..482557fbc9d1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..7de18d94a08d
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	struct device dev;
+	nodemask_t nodelist;
+	int rank;
+};
+
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
+
+static struct bus_type memory_tier_subsys = {
+	.name = "memtier",
+	.dev_name = "memtier",
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+
+static ssize_t nodelist_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memtier->nodelist));
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static ssize_t rank_show(struct device *dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%d\n", memtier->rank);
+}
+static DEVICE_ATTR_RO(rank);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+	&dev_attr_nodelist.attr,
+	&dev_attr_rank.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+	.attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+	&memory_tier_dev_group,
+	NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+	struct memory_tier *tier = to_memory_tier(dev);
+
+	kfree(tier);
+}
+
+/*
+ * Keep it simple by having  direct mapping between
+ * tier index and rank value.
+ */
+static inline int get_rank_from_tier(unsigned int tier)
+{
+	switch (tier) {
+	case MEMORY_TIER_HBM_GPU:
+		return MEMORY_RANK_HBM_GPU;
+	case MEMORY_TIER_DRAM:
+		return MEMORY_RANK_DRAM;
+	case MEMORY_TIER_PMEM:
+		return MEMORY_RANK_PMEM;
+	}
+
+	return 0;
+}
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->rank < memtier->rank) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+	int error;
+	struct memory_tier *memtier;
+
+	if (tier >= MAX_MEMORY_TIERS)
+		return NULL;
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return NULL;
+
+	memtier->dev.id = tier;
+	memtier->rank = get_rank_from_tier(tier);
+	memtier->dev.bus = &memory_tier_subsys;
+	memtier->dev.release = memory_tier_device_release;
+	memtier->dev.groups = memory_tier_dev_groups;
+
+	insert_memory_tier(memtier);
+
+	error = device_register(&memtier->dev);
+	if (error) {
+		list_del(&memtier->list);
+		put_device(&memtier->dev);
+		return NULL;
+	}
+	return memtier;
+}
+
+__maybe_unused // temporay to prevent warnings during bisects
+static void unregister_memory_tier(struct memory_tier *memtier)
+{
+	list_del(&memtier->list);
+	device_unregister(&memtier->dev);
+}
+
+static ssize_t
+max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
+}
+static DEVICE_ATTR_RO(max_tier);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
+}
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memory_tier_attrs[] = {
+	&dev_attr_max_tier.attr,
+	&dev_attr_default_tier.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memory_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+	&memory_tier_attr_group,
+	NULL,
+};
+
+static int __init memory_tier_init(void)
+{
+	int ret;
+	struct memory_tier *memtier;
+
+	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+	if (ret)
+		panic("%s() failed to register subsystem: %d\n", __func__, ret);
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs.
+	 */
+	memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+	if (!memtier)
+		panic("%s() failed to register memory tier: %d\n", __func__, ret);
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);
+
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-07 20:15   ` Tim Chen
  2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

Add support to modify the memory tier for a NUMA node.

/sys/devices/system/node/nodeN/memtier

where N = node id

When read, It list the memory tier that the node belongs to.

When written, the kernel moves the node into the specified
memory tier, the tier assignment of all other nodes are not
affected.

If the memory tier does not exist, writing to the above file
creates the tier and assign the NUMA node to that tier.

mutex memory_tier_lock is introduced to protect memory tier
related chanegs as it can happen from sysfs as well on hot
plug events.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c          |  39 +++++++++++
 include/linux/memory-tiers.h |   3 +
 mm/memory-tiers.c            | 123 ++++++++++++++++++++++++++++++++++-
 3 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ac6376ef7a1..599ed64d910f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
+#include <linux/memory-tiers.h>
 
 static struct bus_type node_subsys = {
 	.name = "node",
@@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
 
+#ifdef CONFIG_TIERED_MEMORY
+static ssize_t memtier_show(struct device *dev,
+			    struct device_attribute *attr,
+			    char *buf)
+{
+	int node = dev->id;
+	int tier_index = node_get_memory_tier_id(node);
+
+	if (tier_index != -1)
+		return sysfs_emit(buf, "%d\n", tier_index);
+	return 0;
+}
+
+static ssize_t memtier_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	unsigned long tier;
+	int node = dev->id;
+	int ret;
+
+	ret = kstrtoul(buf, 10, &tier);
+	if (ret)
+		return ret;
+
+	ret = node_reset_memory_tier(node, tier);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(memtier);
+#endif
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_meminfo.attr,
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+#ifdef CONFIG_TIERED_MEMORY
+	&dev_attr_memtier.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index e17f6b4ee177..91f071804476 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -15,6 +15,9 @@
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
 #define MAX_MEMORY_TIERS  3
 
+int node_get_memory_tier_id(int node);
+int node_set_memory_tier(int node, int tier);
+int node_reset_memory_tier(int node, int tier);
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 7de18d94a08d..9c78c47ad030 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -126,7 +126,6 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 	return memtier;
 }
 
-__maybe_unused // temporay to prevent warnings during bisects
 static void unregister_memory_tier(struct memory_tier *memtier)
 {
 	list_del(&memtier->list);
@@ -162,6 +161,128 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 	NULL,
 };
 
+static struct memory_tier *__node_get_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (node_isset(node, memtier->nodelist))
+			return memtier;
+	}
+	return NULL;
+}
+
+static struct memory_tier *__get_memory_tier_from_id(int id)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (memtier->dev.id == id)
+			return memtier;
+	}
+	return NULL;
+}
+
+__maybe_unused // temporay to prevent warnings during bisects
+static void node_remove_from_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+
+	memtier = __node_get_memory_tier(node);
+	if (!memtier)
+		goto out;
+	/*
+	 * Remove node from tier, if tier becomes
+	 * empty then unregister it to make it invisible
+	 * in sysfs.
+	 */
+	node_clear(node, memtier->nodelist);
+	if (nodes_empty(memtier->nodelist))
+		unregister_memory_tier(memtier);
+
+out:
+	mutex_unlock(&memory_tier_lock);
+}
+
+int node_get_memory_tier_id(int node)
+{
+	int tier = -1;
+	struct memory_tier *memtier;
+	/*
+	 * Make sure memory tier is not unregistered
+	 * while it is being read.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (memtier)
+		tier = memtier->dev.id;
+	mutex_unlock(&memory_tier_lock);
+
+	return tier;
+}
+
+static int __node_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		memtier = register_memory_tier(tier);
+		if (!memtier) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+	node_set(node, memtier->nodelist);
+out:
+	return ret;
+}
+
+int node_reset_memory_tier(int node, int tier)
+{
+	struct memory_tier *current_tier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+
+	current_tier = __node_get_memory_tier(node);
+	if (!current_tier || current_tier->dev.id == tier)
+		goto out;
+
+	node_clear(node, current_tier->nodelist);
+
+	ret = __node_set_memory_tier(node, tier);
+	if (ret) {
+		/* reset it back to older tier */
+		node_set(node, current_tier->nodelist);
+		goto out;
+	}
+
+	if (nodes_empty(current_tier->nodelist))
+		unregister_memory_tier(current_tier);
+out:
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
+int node_set_memory_tier(int node, int tier)
+{
+	struct memory_tier *memtier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (!memtier)
+		ret = __node_set_memory_tier(node, tier);
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
 static int __init memory_tier_init(void)
 {
 	int ret;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 3/9] mm/demotion: Move memory demotion related code
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-06 13:39   ` Bharata B Rao
  2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

This move memory demotion related code to mm/demotion.c.
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  4 +++
 include/linux/migrate.h      |  2 --
 mm/memory-tiers.c            | 60 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 60 +-----------------------------------
 mm/vmscan.c                  |  1 +
 5 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 91f071804476..33ef36395a20 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -15,9 +15,13 @@
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
 #define MAX_MEMORY_TIERS  3
 
+extern bool numa_demotion_enabled;
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
+#else
+#define numa_demotion_enabled	false
+
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..43e737215f33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
 extern void set_migration_target_nodes(void);
 extern void migrate_on_reclaim_init(void);
-extern bool numa_demotion_enabled;
 extern int next_demotion_node(int node);
 #else
 static inline void set_migration_target_nodes(void) {}
@@ -87,7 +86,6 @@ static inline int next_demotion_node(int node)
 {
         return NUMA_NO_NODE;
 }
-#define numa_demotion_enabled  false
 #endif
 
 #ifdef CONFIG_COMPACTION
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 9c78c47ad030..3f382d1f844a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -307,3 +307,63 @@ static int __init memory_tier_init(void)
 }
 subsys_initcall(memory_tier_init);
 
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_demotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif /* CONFIG_SYSFS */
+
diff --git a/mm/migrate.c b/mm/migrate.c
index e51588e95f57..29cacc217e38 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2508,64 +2508,6 @@ void __init migrate_on_reclaim_init(void)
 	set_migration_target_nodes();
 	cpus_read_unlock();
 }
+#endif /* CONFIG_NUMA */
 
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled ? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	ssize_t ret;
-
-	ret = kstrtobool(buf, &numa_demotion_enabled);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
 
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif /* CONFIG_SYSFS */
-#endif /* CONFIG_NUMA */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7..3a8f78277f99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-07 22:51   ` Tim Chen
                     ` (2 more replies)
  2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default tier 1 and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

The rank approach allows us to keep memory tier device IDs stable even if there
is a need to change the tier ordering among different memory tiers. e.g. DRAM
nodes with CPUs will always be on memtier1, no matter how many tiers are higher
or lower than these nodes. A new memory tier can be inserted into the tier
hierarchy for a new set of nodes without affecting the node assignment of any
existing memtier, provided that there is enough gap in the rank values for the
new memtier.

The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
Its value relative to other memtiers decides the level of this memtier in the tier
hierarchy.

For now, This patch supports hardcoded rank values which are 300, 200, & 100 for
memory tiers 0,1 & 2 respectively.

Below is the sysfs interface to read the rank values of memory tier,
/sys/devices/system/memtier/memtierN/rank

This interface is read only for now. Write support can be added when there is
a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
requirement among them.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |   5 +
 include/linux/migrate.h      |  13 --
 mm/memory-tiers.c            | 269 ++++++++++++++++++++++++
 mm/migrate.c                 | 394 -----------------------------------
 mm/vmstat.c                  |   4 -
 5 files changed, 274 insertions(+), 411 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 33ef36395a20..adc2cb3bf5f8 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -16,11 +16,16 @@
 #define MAX_MEMORY_TIERS  3
 
 extern bool numa_demotion_enabled;
+int next_demotion_node(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
 #else
 #define numa_demotion_enabled	false
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 
 #endif	/* CONFIG_TIERED_MEMORY */
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 43e737215f33..93fab62e6548 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #endif /* CONFIG_MIGRATION */
 
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
-        return NUMA_NO_NODE;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 extern int PageMovable(struct page *page);
 extern void __SetPageMovable(struct page *page, struct address_space *mapping);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 3f382d1f844a..0d05c0bfb79b 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -4,6 +4,10 @@
 #include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/memory-tiers.h>
+#include <linux/random.h>
+#include <linux/memory.h>
+
+#include "internal.h"
 
 struct memory_tier {
 	struct list_head list;
@@ -12,6 +16,10 @@ struct memory_tier {
 	int rank;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
 #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
 
 static struct bus_type memory_tier_subsys = {
@@ -19,9 +27,71 @@ static struct bus_type memory_tier_subsys = {
 	.dev_name = "memtier",
 };
 
+static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
 
 static ssize_t nodelist_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
@@ -202,6 +272,7 @@ static void node_remove_from_memory_tier(int node)
 	if (nodes_empty(memtier->nodelist))
 		unregister_memory_tier(memtier);
 
+	establish_migration_targets();
 out:
 	mutex_unlock(&memory_tier_lock);
 }
@@ -263,6 +334,8 @@ int node_reset_memory_tier(int node, int tier)
 
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
+
+	establish_migration_targets();
 out:
 	mutex_unlock(&memory_tier_lock);
 
@@ -276,13 +349,208 @@ int node_set_memory_tier(int node, int tier)
 
 	mutex_lock(&memory_tier_lock);
 	memtier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
+	 * will have skipped this node.
+	 */
 	if (!memtier)
 		ret = __node_set_memory_tier(node, tier);
+	establish_migration_targets();
+
 	mutex_unlock(&memory_tier_lock);
 
 	return ret;
 }
 
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	struct demotion_nodes *nd;
+	int target, nnodes, i;
+
+	if (!node_demotion)
+		return NUMA_NO_NODE;
+
+	nd = &node_demotion[node];
+
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
+	 */
+	rcu_read_lock();
+
+	nnodes = nodes_weight(nd->preferred);
+	if (!nnodes)
+		return NUMA_NO_NODE;
+
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	nnodes = get_random_int() % nnodes;
+	target = first_node(nd->preferred);
+	for (i = 0; i < nnodes; i++)
+		target = next_node(target, nd->preferred);
+
+	rcu_read_unlock();
+
+	return target;
+}
+
+/* Disable reclaim-based migration. */
+static void __disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_node_mask(node, node_states[N_MEMORY])
+		node_demotion[node].preferred = NODE_MASK_NONE;
+}
+
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 */
+	synchronize_rcu();
+}
+
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_migration_targets(void)
+{
+	struct memory_tier *memtier;
+	struct demotion_nodes *nd;
+	int target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t used;
+
+	if (!node_demotion)
+		return;
+
+	disable_all_migrate_targets();
+
+	for_each_node_mask(node, node_states[N_MEMORY]) {
+		best_distance = -1;
+		nd = &node_demotion[node];
+
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+			continue;
+		/*
+		 * Get the next memtier to find the  demotion node list.
+		 */
+		memtier = list_next_entry(memtier, list);
+
+		/*
+		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * Add all memory nodes except the selected memory tier
+		 * nodelist to skip list so that we find the best node from the
+		 * memtier nodelist.
+		 */
+		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
+
+		/*
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
+		 */
+		do {
+			target = find_next_best_node(node, &used);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
+		} while (1);
+	}
+}
+
+/*
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *_arg)
+{
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case MEM_OFFLINE:
+		/*
+		 * In case we are moving out of N_MEMORY. Keep the node
+		 * in the memory tier so that when we bring memory online,
+		 * they appear in the right memory tier. We still need
+		 * to rebuild the demotion order.
+		 */
+		mutex_lock(&memory_tier_lock);
+		establish_migration_targets();
+		mutex_unlock(&memory_tier_lock);
+		break;
+	case MEM_ONLINE:
+		/*
+		 * We ignore the error here, if the node already have the tier
+		 * registered, we will continue to use that for the new memory
+		 * we are adding here.
+		 */
+		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static void __init migrate_on_reclaim_init(void)
+{
+	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+				GFP_KERNEL);
+	WARN_ON(!node_demotion);
+
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+}
+
 static int __init memory_tier_init(void)
 {
 	int ret;
@@ -302,6 +570,7 @@ static int __init memory_tier_init(void)
 
 	/* CPU only nodes are not part of memory tiers. */
 	memtier->nodelist = node_states[N_MEMORY];
+	migrate_on_reclaim_init();
 
 	return 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index 29cacc217e38..0b554625a219 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2116,398 +2116,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
-	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	/*
-	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
-	 *
-	 * Make sure to use RCU over entire code blocks if
-	 * node_demotion[] reads need to be consistent.
-	 */
-	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
-
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
-
-	target = READ_ONCE(nd->nodes[index]);
-
-out:
-	rcu_read_unlock();
-	return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
-	int node, i;
-
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
-}
-
-static void disable_all_migrate_targets(void)
-{
-	__disable_all_migrate_targets();
-
-	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
-{
-	int migration_target, index, val;
-	struct demotion_nodes *nd;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
-
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass;
-	nodemask_t this_pass;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
-
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
-
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
-
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
-
-		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
-		 */
-		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
-				break;
-
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
-		} while (1);
-	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *_arg)
-{
-	struct memory_notify *arg = _arg;
-
-	/*
-	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
-	 */
-	if (arg->status_change_nid < 0)
-		return notifier_from_errno(0);
-
-	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
-	case MEM_OFFLINE:
-	case MEM_ONLINE:
-		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_CANCEL_OFFLINE:
-		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		break;
-	}
-
-	return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);
-	WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
-}
 #endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index da525bfb6f4a..835e3c028f35 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
-#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-	migrate_on_reclaim_init();
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is memory tier 1 which is designated for nodes with DRAM, so it
is not the right tier for dax devices.

Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
support should be added to distinguish the dax-devices which should
not be MEMORY_TIER_PMEM and right memory tier should be set for them.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c | 4 ++++
 mm/memory-tiers.c  | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..7a11c387fbbc 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/memory-tiers.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
+#ifdef CONFIG_TIERED_MEMORY
+	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
 	return 0;
 
 err_request_mem:
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0d05c0bfb79b..9c82cf4c4bca 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -364,6 +364,7 @@ int node_set_memory_tier(int node, int tier)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(node_set_memory_tier);
 
 /**
  * next_demotion_node() - Get the next node in the demotion path
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-07 23:40   ` Tim Chen
  2022-06-08  6:59   ` Ying Huang
  2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

This patch adds the special string "none" as a supported memtier value
that we can use to remove a specific node from being using as demotion target.

For ex:
:/sys/devices/system/node/node1# cat memtier
1
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
1-3
:/sys/devices/system/node/node1# echo none > memtier
:/sys/devices/system/node/node1#
:/sys/devices/system/node/node1# cat memtier
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
2-3
:/sys/devices/system/node/node1#

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c          |  4 ++++
 include/linux/memory-tiers.h |  1 +
 mm/memory-tiers.c            | 18 +++++++++++++++---
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 599ed64d910f..344786290149 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -582,6 +582,10 @@ static ssize_t memtier_store(struct device *dev,
 	int node = dev->id;
 	int ret;
 
+	if (!strncmp(buf, "none", strlen("none"))) {
+		node_remove_from_memory_tier(node);
+		return count;
+	}
 	ret = kstrtoul(buf, 10, &tier);
 	if (ret)
 		return ret;
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index adc2cb3bf5f8..79bd8d26feb2 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -17,6 +17,7 @@
 
 extern bool numa_demotion_enabled;
 int next_demotion_node(int node);
+void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 9c82cf4c4bca..b4e72b672d4d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -253,8 +253,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
-__maybe_unused // temporay to prevent warnings during bisects
-static void node_remove_from_memory_tier(int node)
+void node_remove_from_memory_tier(int node)
 {
 	struct memory_tier *memtier;
 
@@ -320,7 +319,20 @@ int node_reset_memory_tier(int node, int tier)
 	mutex_lock(&memory_tier_lock);
 
 	current_tier = __node_get_memory_tier(node);
-	if (!current_tier || current_tier->dev.id == tier)
+	if (!current_tier) {
+		/*
+		 * If a N_MEMORY node doesn't have a tier index, then
+		 * we removed it from demotion earlier and we are trying
+		 * add it back. Just add the node to requested tier.
+		 */
+		if (node_state(node, N_MEMORY)) {
+			ret = __node_set_memory_tier(node, tier);
+			establish_migration_targets();
+		}
+		goto out;
+	}
+
+	if (current_tier->dev.id == tier)
 		goto out;
 
 	node_clear(node, current_tier->nodelist);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path,
not any other node from any lower tier.  This strict, hard-coded
demotion order does not work in all use cases (e.g. some use cases
may want to allow cross-socket demotion to another node in the same
demotion tier as a fallback when the preferred demotion node is out
of space). This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from any
lower tier, whereas the demotion order doesn't allow that currently.

This patch adds support to get all the allowed demotion targets mask
for node, also demote_page_list() function is modified to utilize this
allowed node mask by filling it in migration_target_control structure
before passing it to migrate_pages().

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  5 ++++
 mm/memory-tiers.c            | 49 ++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                  | 38 +++++++++++++---------------
 3 files changed, 70 insertions(+), 22 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 79bd8d26feb2..cd6e71f702ad 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
+void node_get_allowed_targets(int node, nodemask_t *targets);
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
@@ -28,6 +29,10 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 
+static inline void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index b4e72b672d4d..592d939ec28d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -18,6 +18,7 @@ struct memory_tier {
 
 struct demotion_nodes {
 	nodemask_t preferred;
+	nodemask_t allowed;
 };
 
 #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
@@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier)
 }
 EXPORT_SYMBOL_GPL(node_set_memory_tier);
 
+void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * If any node is moving to lower tiers then modifications
+	 * in node_demotion[] are still valid for this node, if any
+	 * node is moving to higher tier then moving node may be
+	 * used once for demotion which should be ok so rcu should
+	 * be enough here.
+	 */
+	rcu_read_lock();
+
+	*targets = node_demotion[node].allowed;
+
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void)
 {
 	int node;
 
-	for_each_node_mask(node, node_states[N_MEMORY])
+	for_each_node_mask(node, node_states[N_MEMORY]) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		node_demotion[node].allowed = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -465,7 +487,7 @@ static void establish_migration_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
+	nodemask_t used, allowed = NODE_MASK_NONE;
 
 	if (!node_demotion)
 		return;
@@ -511,6 +533,29 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Now build the allowed mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(allowed, allowed, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(allowed, node_states[N_MEMORY], allowed);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from allowed nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the allowed mask.
+		 */
+		nodes_andnot(allowed, allowed, memtier->nodelist);
+		for_each_node_mask(node, memtier->nodelist)
+			node_demotion[node].allowed = allowed;
+	}
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..d424b7af2f26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,23 +1460,6 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
-{
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
-
-	return alloc_migration_target(page, (unsigned long)&mtc);
-}
-
 /*
  * Take pages on @demote_list and attempt to demote them to
  * another node.  Pages which are not demoted are left on
@@ -1487,6 +1470,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1490,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
-	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+	migrate_pages(demote_pages, alloc_migration_target, NULL,
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

All N_MEMORY nodes are divided into 3 memoty tiers with rank value
MEMORY_RANK_HBM_GPU, MEMORY_RANK_DRAM and MEMORY_RANK_PMEM. By default,
All nodes are assigned to memory tier with rank value MEMORY_RANK_DRAM.

Demotion path for all N_MEMORY nodes is prepared based on the rank value
of memory tiers.

This patch adds documention for memory tiering introduction, its sysfs
interfaces and how demotion is performed based on memory tiers.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/memory-tiering.rst         | 175 ++++++++++++++++++
 2 files changed, 176 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index c21b5823f126..3f211cbca8c3 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -32,6 +32,7 @@ the Linux memory management.
    idle_page_tracking
    ksm
    memory-hotplug
+   memory-tiering
    nommu-mmap
    numa_memory_policy
    numaperf
diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst
new file mode 100644
index 000000000000..afbb9591f301
--- /dev/null
+++ b/Documentation/admin-guide/mm/memory-tiering.rst
@@ -0,0 +1,175 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _admin_guide_memory_tiering:
+
+============
+Memory tiers
+============
+
+This document describes explicit memory tiering support along with
+demotion based on memory tiers.
+
+Introduction
+============
+
+Many systems have multiple type of memory devices e.g. GPU, DRAM and
+PMEM. The memory subsystem of these systems can be called memory
+tiering system because the performance of the different types of
+memory is different. Memory tiers are defined based on hardware
+capabilities of memory nodes. Each memory tier is assigned a rank
+value that determines the memory tier position in demotion order.
+
+The memory tier assignment of each node is independent from each
+other. Moving a node from one tier to another tier doesn't affect
+the tier assignment of any other node.
+
+Memory tiers are used to build the demotion targets for nodes, a node
+can demote its pages to any node of any lower tiers.
+
+Memory tier rank
+=================
+
+Memory nodes are divided into below 3 types of memory tiers with rank value
+as shown base on their hardware characteristics.
+
+MEMORY_RANK_HBM_GPU
+MEMORY_RANK_DRAM
+MEMORY_RANK_PMEM
+
+Memory tiers initialization and (re)assignments
+===============================================
+
+By default, all nodes are assigned to memory tier with default rank
+DEFAULT_MEMORY_RANK which is 1 (MEMORY_RANK_DRAM). Memory tier of
+memory node can be either modified through sysfs or from driver. On
+hotplug, memory tier with default rank is assigned to memory node.
+
+Sysfs interfaces
+================
+
+Nodes belonging to specific tier can be read from,
+/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
+
+Where N is 0 - 2.
+
+Example 1:
+For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
+node 2 is PMEM node an ideal tier layout will be
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+1
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2
+
+Example 2:
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes.
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
+directory
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0-1
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2-3
+
+Default memory tier can be read from,
+/sys/devices/system/memtier/default_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/default_tier
+memtier1
+
+Max memory tier can be read from,
+/sys/devices/system/memtier/max_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/max_tier
+3
+
+Individual node's memory tier can be read of set using,
+/sys/devices/system/node/nodeN/memtier	(Read-Write)
+
+where N = node id
+
+When this interface is written, Node is moved from old memory tier
+to new memory tier and demotion targets for all N_MEMORY nodes are
+built again.
+
+For example 1 mentioned above,
+$ cat /sys/devices/system/node/node0/memtier
+1
+$ cat /sys/devices/system/node/node1/memtier
+0
+$ cat /sys/devices/system/node/node2/memtier
+2
+
+Demotion
+========
+
+In a system with DRAM and persistent memory, once DRAM
+fills up, reclaim will start and some of the DRAM contents will be
+thrown out even if there is a space in persistent memory.
+Consequently allocations will, at some point, start falling over to the slower
+persistent memory.
+
+That has two nasty properties. First, the newer allocations can end up in
+the slower persistent memory. Second, reclaimed data in DRAM are just
+discarded even if there are gobs of space in persistent memory that could
+be used.
+
+Instead of page being discarded during reclaim, it can be moved to
+persistent memory. Allowing page migration during reclaim enables
+these systems to migrate pages from fast(higher) tiers to slow(lower)
+tiers when the fast(higher) tier is under pressure.
+
+
+Enable/Disable demotion
+-----------------------
+
+By default demotion is disabled, it can be enabled/disabled using
+below sysfs interface,
+
+$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+preferred and allowed demotion nodes
+------------------------------------
+
+Preffered nodes for a specific N_MEMORY nodes are best nodes
+from next possible lower memory tier. Allowed nodes for any
+node are all the node available in all possible lower memory
+tiers.
+
+Example:
+
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes,
+
+node distances:
+node   0    1    2    3
+   0  10   20   30   40
+   1  20   10   40   30
+   2  30   40   10   40
+   3  40   30   40   10
+
+memory_tiers[0] = <empty>
+memory_tiers[1] = 0-1
+memory_tiers[2] = 2-3
+
+node_demotion[0].preferred = 2
+node_demotion[0].allowed   = 2, 3
+node_demotion[1].preferred = 3
+node_demotion[1].allowed   = 3, 2
+node_demotion[2].preferred = <empty>
+node_demotion[2].allowed   = <empty>
+node_demotion[3].preferred = <empty>
+node_demotion[3].allowed   = <empty>
+
+Memory allocation for demotion
+------------------------------
+
+If page needs to be demoted from any node, the kernel 1st tries
+to allocate new page from node's preferred node and fallbacks to
+node's allowed targets in allocation fallback order.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
@ 2022-06-03 13:42 ` Aneesh Kumar K.V
  2022-06-06  3:11   ` Ying Huang
  2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
  2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
  10 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-03 13:42 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes, Aneesh Kumar K.V

With memory tiers support we can have memory on NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers. To
avoid taking locks, a nodemask is maintained for all demotion
targets. All NUMA nodes are by default top tier nodes and as
we add new lower memory tiers NUMA nodes get added to the
demotion targets thereby moving them out of the top tier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 16 ++++++++++++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 10 ++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index cd6e71f702ad..32e0e6fabf02 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -16,12 +16,23 @@
 #define MAX_MEMORY_TIERS  3
 
 extern bool numa_demotion_enabled;
+extern nodemask_t demotion_target_mask;
 int next_demotion_node(int node);
 void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
 void node_get_allowed_targets(int node, nodemask_t *targets);
+
+/*
+ * By default all nodes are top tiper. As we create new memory tiers
+ * we below top tiers we add them to NON_TOP_TIER state.
+ */
+static inline bool node_is_toptier(int node)
+{
+	return !node_isset(node, demotion_target_mask);
+}
+
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
@@ -33,6 +44,11 @@ static inline void node_get_allowed_targets(int node, nodemask_t *targets)
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a77c78a2b6b5..294873d4be2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 592d939ec28d..df8e9910165a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -31,6 +31,7 @@ static struct bus_type memory_tier_subsys = {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+nodemask_t demotion_target_mask;
 
 /*
  * node_demotion[] examples:
@@ -541,6 +542,15 @@ static void establish_migration_targets(void)
 	 */
 	list_for_each_entry(memtier, &memory_tiers, list)
 		nodes_or(allowed, allowed, memtier->nodelist);
+	/*
+	 * Add nodes to demotion target mask so that we can find
+	 * top tier easily.
+	 */
+	memtier = list_first_entry(&memory_tiers, struct memory_tier, list);
+	if (memtier)
+		nodes_andnot(demotion_target_mask, allowed, memtier->nodelist);
+	else
+		demotion_target_mask = NODE_MASK_NONE;
 	/*
 	 * Removes nodes not yet in N_MEMORY.
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index 0b554625a219..78615c48fc0f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-06-06  3:11   ` Ying Huang
  2022-06-06  3:52     ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-06  3:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> With memory tiers support we can have memory on NUMA nodes
> in the top tier from which we want to avoid promotion tracking NUMA
> faults. Update node_is_toptier to work with memory tiers. To
> avoid taking locks, a nodemask is maintained for all demotion
> targets. All NUMA nodes are by default top tier nodes and as
> we add new lower memory tiers NUMA nodes get added to the
> demotion targets thereby moving them out of the top tier.

Check the usage of node_is_toptier(),

- migrate_misplaced_page()
  node_is_toptier() is used to check whether migration is a promotion. 
We can avoid to use it.  Just compare the rank of the nodes.

- change_pte_range() and change_huge_pmd()
  node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
for promotion.  So I think we should change the name to node_is_fast()
as follows,

static inline bool node_is_fast(int node)
{
	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
}

And, as above, I suggest to add memory tier ID and rank to struct
pglist_data directly.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-06  3:11   ` Ying Huang
@ 2022-06-06  3:52     ` Aneesh Kumar K V
  2022-06-06  7:24       ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  3:52 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/6/22 8:41 AM, Ying Huang wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>> With memory tiers support we can have memory on NUMA nodes
>> in the top tier from which we want to avoid promotion tracking NUMA
>> faults. Update node_is_toptier to work with memory tiers. To
>> avoid taking locks, a nodemask is maintained for all demotion
>> targets. All NUMA nodes are by default top tier nodes and as
>> we add new lower memory tiers NUMA nodes get added to the
>> demotion targets thereby moving them out of the top tier.
> 
> Check the usage of node_is_toptier(),
> 
> - migrate_misplaced_page()
>    node_is_toptier() is used to check whether migration is a promotion.
> We can avoid to use it.  Just compare the rank of the nodes.
> 
> - change_pte_range() and change_huge_pmd()
>    node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> for promotion.  So I think we should change the name to node_is_fast()
> as follows,
> 
> static inline bool node_is_fast(int node)
> {
> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> }
> 

But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other 
patches, absolute value of rank doesn't carry any meaning. It is only
the relative value w.r.t other memory tiers that decide whether it is 
fast or not. Agreed by default memory tiers get built with 
MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' 
Hence to determine a node is consisting of fast memory is essentially 
figuring out whether node is the top most tier in memory hierarchy and 
not just the memory tier rank value is >= MEMORY_RANK_DRAM?

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH] mm/demotion: Add sysfs ABI documentation
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-06-06  4:53 ` Aneesh Kumar K.V
  2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
  10 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-06  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes


From dd986c5aab4f6ccf41cc8d2dde5d9702a17adb6f Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Date: Mon, 6 Jun 2022 10:14:17 +0530
Subject: [PATCH] mm/demotion: Add sysfs ABI documentation

Add sysfs ABI documentation.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 .../ABI/testing/sysfs-kernel-mm-memory-tiers  | 81 +++++++++++++++++++
 1 file changed, 81 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
new file mode 100644
index 000000000000..41b0d1756ddb
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
@@ -0,0 +1,81 @@
+What:		/sys/devices/system/memtier/
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		This is the directory containing the information about memory tiers.
+
+		Each memory tier has its own subdirectory.
+
+		The order of memory tiers is determined by their rank values, not by
+		their memtier device names.  A higher rank value means a higher tier.
+
+What:		/sys/devices/system/memtier/default_tier
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		The default memory tier to which memory would get added via hotplug
+		if the NUMA node is not part of any memory tier.
+
+What:		/sys/devices/system/memtier/max_tier
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		The max memory tier device ID we can create. Users can create memory
+		tiers in range [0 - max_tier)
+
+What:		/sys/devices/system/memtier/memtierN/
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		This is the directory containing the information about a particular
+		memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).
+
+		The memtier device ID number itself is just an identifier and has no
+		special meaning, i.e. memtier device ID numbers do not determine the
+		order of memory tiers.
+
+What:		/sys/devices/system/memtier/memtierN/rank
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+
+		When read, list the "rank" value associated with memtierN.
+
+		"Rank" is an opaque value. Its absolute value doesn't have any
+		special meaning. But the rank values of different memtiers can be
+		compared with each other to determine the memory tier order.
+
+		For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
+		their rank values are 100, 10, 50, then the memory tier order is:
+		memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
+		and memtier1 is the lowest tier.
+
+		The rank value of each memtier should be unique.
+
+What:		/sys/devices/system/memtier/memtierN/nodelist
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+
+		When read, list the memory nodes in the specified tier.
+
+What:		/sys/devices/system/node/nodeN/memtier
+Date:		Jun 2022
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for tiered memory
+
+		When read, list the device ID of the memory tier that the node belongs
+		to.  Its value is empty for a CPU-only NUMA node.
+
+		When written, the kernel moves the node into the specified memory
+		tier if the move is allowed. The tier assignments of all other
+		nodes are not affected.
+
+		When written with the special string "none" the specific node is removed
+		from participating in memory demotion.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-06  3:52     ` Aneesh Kumar K V
@ 2022-06-06  7:24       ` Ying Huang
  2022-06-06  8:33         ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-06  7:24 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 8:41 AM, Ying Huang wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > With memory tiers support we can have memory on NUMA nodes
> > > in the top tier from which we want to avoid promotion tracking NUMA
> > > faults. Update node_is_toptier to work with memory tiers. To
> > > avoid taking locks, a nodemask is maintained for all demotion
> > > targets. All NUMA nodes are by default top tier nodes and as
> > > we add new lower memory tiers NUMA nodes get added to the
> > > demotion targets thereby moving them out of the top tier.
> > 
> > Check the usage of node_is_toptier(),
> > 
> > - migrate_misplaced_page()
> >    node_is_toptier() is used to check whether migration is a promotion.
> > We can avoid to use it.  Just compare the rank of the nodes.
> > 
> > - change_pte_range() and change_huge_pmd()
> >    node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > for promotion.  So I think we should change the name to node_is_fast()
> > as follows,
> > 
> > static inline bool node_is_fast(int node)
> > {
> > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > }
> > 
> 
> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other 
> patches, absolute value of rank doesn't carry any meaning. It is only
> the relative value w.r.t other memory tiers that decide whether it is 
> fast or not. Agreed by default memory tiers get built with 
> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' 
> Hence to determine a node is consisting of fast memory is essentially 
> figuring out whether node is the top most tier in memory hierarchy and 
> not just the memory tier rank value is >= MEMORY_RANK_DRAM?

In a system with 3 tiers,

HBM	0
DRAM	1
PMEM	2

In your implementation, only HBM will be considered fast.  But what we
need is to consider both HBM and DRAM fast.  Because we use NUMA
balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
and DRAM pages for that.  And there're no requirements to promote DRAM
pages to HBM with NUMA balancing.

I can understand that the memory tiers are more dynamic now.  For
requirements of NUMA balancing, we need the lowest memory tier (rank)
where there's at least one node with CPU.  The nodes in it and the
higher tiers will be considered fast. 


Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-06  7:24       ` Ying Huang
@ 2022-06-06  8:33         ` Aneesh Kumar K V
  2022-06-08  7:26           ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  8:33 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/6/22 12:54 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 8:41 AM, Ying Huang wrote:
>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>> With memory tiers support we can have memory on NUMA nodes
>>>> in the top tier from which we want to avoid promotion tracking NUMA
>>>> faults. Update node_is_toptier to work with memory tiers. To
>>>> avoid taking locks, a nodemask is maintained for all demotion
>>>> targets. All NUMA nodes are by default top tier nodes and as
>>>> we add new lower memory tiers NUMA nodes get added to the
>>>> demotion targets thereby moving them out of the top tier.
>>>
>>> Check the usage of node_is_toptier(),
>>>
>>> - migrate_misplaced_page()
>>>     node_is_toptier() is used to check whether migration is a promotion.
>>> We can avoid to use it.  Just compare the rank of the nodes.
>>>
>>> - change_pte_range() and change_huge_pmd()
>>>     node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
>>> for promotion.  So I think we should change the name to node_is_fast()
>>> as follows,
>>>
>>> static inline bool node_is_fast(int node)
>>> {
>>> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
>>> }
>>>
>>
>> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
>> patches, absolute value of rank doesn't carry any meaning. It is only
>> the relative value w.r.t other memory tiers that decide whether it is
>> fast or not. Agreed by default memory tiers get built with
>> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
>> Hence to determine a node is consisting of fast memory is essentially
>> figuring out whether node is the top most tier in memory hierarchy and
>> not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> 
> In a system with 3 tiers,
> 
> HBM	0
> DRAM	1
> PMEM	2
> 
> In your implementation, only HBM will be considered fast.  But what we
> need is to consider both HBM and DRAM fast.  Because we use NUMA
> balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> and DRAM pages for that.  And there're no requirements to promote DRAM
> pages to HBM with NUMA balancing.
> 
> I can understand that the memory tiers are more dynamic now.  For
> requirements of NUMA balancing, we need the lowest memory tier (rank)
> where there's at least one node with CPU.  The nodes in it and the
> higher tiers will be considered fast.
> 

is this good (not tested)?
/*
  * build the allowed promotion mask. Promotion is allowed
  * from higher memory tier to lower memory tier only if
  * lower memory tier doesn't include compute. We want to
  * skip promotion from a memory tier, if any node which is
  * part of that memory tier have CPUs. Once we detect such
  * a memory tier, we consider that tier as top tier from
  * which promotion is not allowed.
  */
list_for_each_entry_reverse(memtier, &memory_tiers, list) {
	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
	if (nodes_empty(allowed))
		nodes_or(promotion_mask, promotion_mask, allowed);
	else
		break;
}

and then

static inline bool node_is_toptier(int node)
{

	return !node_isset(node, promotion_mask);
}


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 3/9] mm/demotion: Move memory demotion related code
  2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-06-06 13:39   ` Bharata B Rao
  0 siblings, 0 replies; 84+ messages in thread
From: Bharata B Rao @ 2022-06-06 13:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/3/2022 7:12 PM, Aneesh Kumar K.V wrote:
> This move memory demotion related code to mm/demotion.c.

*mm/memory-tiers.c

> No functional change in this patch.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-06-07 18:43   ` Tim Chen
  2022-06-07 20:18     ` Wei Xu
                       ` (2 more replies)
  2022-06-07 21:32   ` Yang Shi
  2022-06-08 14:11   ` Johannes Weiner
  2 siblings, 3 replies; 84+ messages in thread
From: Tim Chen @ 2022-06-07 18:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> 
> 
> The nodes which are part of a specific memory tier can be listed
> via
> /sys/devices/system/memtier/memtierN/nodelist
> 
> "Rank" is an opaque value. Its absolute value doesn't have any
> special meaning. But the rank values of different memtiers can be
> compared with each other to determine the memory tier order.
> 
> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> their rank values are 300, 200, 100, then the memory tier order is:
> memtier0 -> memtier2 -> memtier1, 

Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
the order should be memtier0 -> memtier1 -> memtier2?
                    (rank 300)  (rank 200)  (rank 100)

> where memtier0 is the highest tier
> and memtier1 is the lowest tier.

I think memtier2 is the lowest as it has the lowest rank value.
> 
> The rank value of each memtier should be unique.
> 
> 
> +
> +static void memory_tier_device_release(struct device *dev)
> +{
> +	struct memory_tier *tier = to_memory_tier(dev);
> +

Do we need some ref counts on memory_tier?
If there is another device still using the same memtier,
free below could cause problem.

> +	kfree(tier);
> +}
> +
> 
...
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +	int error;
> +	struct memory_tier *memtier;
> +
> +	if (tier >= MAX_MEMORY_TIERS)
> +		return NULL;
> +
> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!memtier)
> +		return NULL;
> +
> +	memtier->dev.id = tier;
> +	memtier->rank = get_rank_from_tier(tier);
> +	memtier->dev.bus = &memory_tier_subsys;
> +	memtier->dev.release = memory_tier_device_release;
> +	memtier->dev.groups = memory_tier_dev_groups;
> +

Should you take the mem_tier_lock before you insert to
memtier-list?

> +	insert_memory_tier(memtier);
> +
> +	error = device_register(&memtier->dev);
> +	if (error) {
> +		list_del(&memtier->list);
> +		put_device(&memtier->dev);
> +		return NULL;
> +	}
> +	return memtier;
> +}
> +
> +__maybe_unused // temporay to prevent warnings during bisects
> +static void unregister_memory_tier(struct memory_tier *memtier)
> +{

I think we should take mem_tier_lock before modifying memtier->list.

> +	list_del(&memtier->list);
> +	device_unregister(&memtier->dev);
> +}
> +
> 

Thanks.

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
@ 2022-06-07 20:15   ` Tim Chen
  2022-06-08  4:55     ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Tim Chen @ 2022-06-07 20:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> 
>  
> +static struct memory_tier *__node_get_memory_tier(int node)
> +{
> +	struct memory_tier *memtier;
> +
> +	list_for_each_entry(memtier, &memory_tiers, list) {

We could need to map node to mem_tier quite often, if we need
to account memory usage at tier level.  It will be more efficient
to have a pointer from node (pgdat) to memtier rather
than doing a search through the list.


> +		if (node_isset(node, memtier->nodelist))
> +			return memtier;
> +	}
> +	return NULL;
> +}
> +
> 

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-07 18:43   ` Tim Chen
@ 2022-06-07 20:18     ` Wei Xu
  2022-06-08  4:30     ` Aneesh Kumar K V
  2022-06-08  4:37     ` Aneesh Kumar K V
  2 siblings, 0 replies; 84+ messages in thread
From: Wei Xu @ 2022-06-07 20:18 UTC (permalink / raw)
  To: Tim Chen
  Cc: Aneesh Kumar K.V, Linux MM, Andrew Morton, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Tue, Jun 7, 2022 at 11:43 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> >
> >
> > The nodes which are part of a specific memory tier can be listed
> > via
> > /sys/devices/system/memtier/memtierN/nodelist
> >
> > "Rank" is an opaque value. Its absolute value doesn't have any
> > special meaning. But the rank values of different memtiers can be
> > compared with each other to determine the memory tier order.
> >
> > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > their rank values are 300, 200, 100, then the memory tier order is:
> > memtier0 -> memtier2 -> memtier1,
>
> Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> the order should be memtier0 -> memtier1 -> memtier2?
>                     (rank 300)  (rank 200)  (rank 100)

I think this is a copy-and-modify typo from my original memory tiering
kernel interface RFC (v4,
https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com/T/):
where the rank values are 100, 10, 50 (i.e the rank of memtier2 is
higher than memtier1).

> > where memtier0 is the highest tier
> > and memtier1 is the lowest tier.
>
> I think memtier2 is the lowest as it has the lowest rank value.
> >
> > The rank value of each memtier should be unique.
> >
> >
> > +
> > +static void memory_tier_device_release(struct device *dev)
> > +{
> > +     struct memory_tier *tier = to_memory_tier(dev);
> > +
>
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
>
> > +     kfree(tier);
> > +}
> > +
> >
> ...
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +     int error;
> > +     struct memory_tier *memtier;
> > +
> > +     if (tier >= MAX_MEMORY_TIERS)
> > +             return NULL;
> > +
> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +     if (!memtier)
> > +             return NULL;
> > +
> > +     memtier->dev.id = tier;
> > +     memtier->rank = get_rank_from_tier(tier);
> > +     memtier->dev.bus = &memory_tier_subsys;
> > +     memtier->dev.release = memory_tier_device_release;
> > +     memtier->dev.groups = memory_tier_dev_groups;
> > +
>
> Should you take the mem_tier_lock before you insert to
> memtier-list?
>
> > +     insert_memory_tier(memtier);
> > +
> > +     error = device_register(&memtier->dev);
> > +     if (error) {
> > +             list_del(&memtier->list);
> > +             put_device(&memtier->dev);
> > +             return NULL;
> > +     }
> > +     return memtier;
> > +}
> > +
> > +__maybe_unused // temporay to prevent warnings during bisects
> > +static void unregister_memory_tier(struct memory_tier *memtier)
> > +{
>
> I think we should take mem_tier_lock before modifying memtier->list.
>
> > +     list_del(&memtier->list);
> > +     device_unregister(&memtier->dev);
> > +}
> > +
> >
>
> Thanks.
>
> Tim
>
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-06-07 18:43   ` Tim Chen
@ 2022-06-07 21:32   ` Yang Shi
  2022-06-08  1:34     ` Ying Huang
  2022-06-08  4:58     ` Aneesh Kumar K V
  2022-06-08 14:11   ` Johannes Weiner
  2 siblings, 2 replies; 84+ messages in thread
From: Yang Shi @ 2022-06-07 21:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linux MM, Andrew Morton, Wei Xu, Huang Ying, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
>
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
>
> This patch series address the above by defining memory tiers explicitly.
>
> This patch introduce explicity memory tiers with ranks. The rank
> value of a memory tier is used to derive the demotion order between
> NUMA nodes. The memory tiers present in a system can be found at
>
> /sys/devices/system/memtier/memtierN/
>
> The nodes which are part of a specific memory tier can be listed
> via
> /sys/devices/system/memtier/memtierN/nodelist
>
> "Rank" is an opaque value. Its absolute value doesn't have any
> special meaning. But the rank values of different memtiers can be
> compared with each other to determine the memory tier order.
>
> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> their rank values are 300, 200, 100, then the memory tier order is:
> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> and memtier1 is the lowest tier.
>
> The rank value of each memtier should be unique.
>
> A higher rank memory tier will appear first in the demotion order
> than a lower rank memory tier. ie. while reclaim we choose a node
> in higher rank memory tier to demote pages to as compared to a node
> in a lower rank memory tier.
>
> For now we are not adding the dynamic number of memory tiers.
> But a future series supporting that is possible. Currently
> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> When doing memory hotplug, if not added to a memory tier, the NUMA
> node gets added to DEFAULT_MEMORY_TIER(1).
>
> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
>
> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  20 ++++
>  mm/Kconfig                   |  11 ++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
>  4 files changed, 220 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..e17f6b4ee177
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +#define MEMORY_TIER_HBM_GPU    0
> +#define MEMORY_TIER_DRAM       1
> +#define MEMORY_TIER_PMEM       2
> +
> +#define MEMORY_RANK_HBM_GPU    300
> +#define MEMORY_RANK_DRAM       200
> +#define MEMORY_RANK_PMEM       100
> +
> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIERS  3
> +
> +#endif /* CONFIG_TIERED_MEMORY */
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..08a3d330740b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  config ARCH_ENABLE_THP_MIGRATION
>         bool
>
> +config TIERED_MEMORY
> +       bool "Support for explicit memory tiers"
> +       def_bool n
> +       depends on MIGRATION && NUMA
> +       help
> +         Support to split nodes into memory tiers explicitly and
> +         to demote pages on reclaim to lower tiers. This option
> +         also exposes sysfs interface to read nodes available in
> +         specific tier and to move specific node among different
> +         possible tiers.

IMHO we should not need a new kernel config. If tiering is not present
then there is just one tier on the system. And tiering is a kind of
hardware configuration, the information could be shown regardless of
whether demotion/promotion is supported/enabled or not.

> +
>  config HUGETLB_PAGE_SIZE_VARIABLE
>         def_bool n
>         help
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..482557fbc9d1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)          += memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..7de18d94a08d
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/device.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +       struct list_head list;
> +       struct device dev;
> +       nodemask_t nodelist;
> +       int rank;
> +};
> +
> +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> +
> +static struct bus_type memory_tier_subsys = {
> +       .name = "memtier",
> +       .dev_name = "memtier",
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +
> +
> +static ssize_t nodelist_show(struct device *dev,
> +                            struct device_attribute *attr, char *buf)
> +{
> +       struct memory_tier *memtier = to_memory_tier(dev);
> +
> +       return sysfs_emit(buf, "%*pbl\n",
> +                         nodemask_pr_args(&memtier->nodelist));
> +}
> +static DEVICE_ATTR_RO(nodelist);
> +
> +static ssize_t rank_show(struct device *dev,
> +                        struct device_attribute *attr, char *buf)
> +{
> +       struct memory_tier *memtier = to_memory_tier(dev);
> +
> +       return sysfs_emit(buf, "%d\n", memtier->rank);
> +}
> +static DEVICE_ATTR_RO(rank);
> +
> +static struct attribute *memory_tier_dev_attrs[] = {
> +       &dev_attr_nodelist.attr,
> +       &dev_attr_rank.attr,
> +       NULL
> +};
> +
> +static const struct attribute_group memory_tier_dev_group = {
> +       .attrs = memory_tier_dev_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_dev_groups[] = {
> +       &memory_tier_dev_group,
> +       NULL
> +};
> +
> +static void memory_tier_device_release(struct device *dev)
> +{
> +       struct memory_tier *tier = to_memory_tier(dev);
> +
> +       kfree(tier);
> +}
> +
> +/*
> + * Keep it simple by having  direct mapping between
> + * tier index and rank value.
> + */
> +static inline int get_rank_from_tier(unsigned int tier)
> +{
> +       switch (tier) {
> +       case MEMORY_TIER_HBM_GPU:
> +               return MEMORY_RANK_HBM_GPU;
> +       case MEMORY_TIER_DRAM:
> +               return MEMORY_RANK_DRAM;
> +       case MEMORY_TIER_PMEM:
> +               return MEMORY_RANK_PMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static void insert_memory_tier(struct memory_tier *memtier)
> +{
> +       struct list_head *ent;
> +       struct memory_tier *tmp_memtier;
> +
> +       list_for_each(ent, &memory_tiers) {
> +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> +               if (tmp_memtier->rank < memtier->rank) {
> +                       list_add_tail(&memtier->list, ent);
> +                       return;
> +               }
> +       }
> +       list_add_tail(&memtier->list, &memory_tiers);
> +}
> +
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +       int error;
> +       struct memory_tier *memtier;
> +
> +       if (tier >= MAX_MEMORY_TIERS)
> +               return NULL;
> +
> +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +       if (!memtier)
> +               return NULL;
> +
> +       memtier->dev.id = tier;
> +       memtier->rank = get_rank_from_tier(tier);
> +       memtier->dev.bus = &memory_tier_subsys;
> +       memtier->dev.release = memory_tier_device_release;
> +       memtier->dev.groups = memory_tier_dev_groups;
> +
> +       insert_memory_tier(memtier);
> +
> +       error = device_register(&memtier->dev);
> +       if (error) {
> +               list_del(&memtier->list);
> +               put_device(&memtier->dev);
> +               return NULL;
> +       }
> +       return memtier;
> +}
> +
> +__maybe_unused // temporay to prevent warnings during bisects
> +static void unregister_memory_tier(struct memory_tier *memtier)
> +{
> +       list_del(&memtier->list);
> +       device_unregister(&memtier->dev);
> +}
> +
> +static ssize_t
> +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> +}
> +static DEVICE_ATTR_RO(max_tier);
> +
> +static ssize_t
> +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> +}
> +static DEVICE_ATTR_RO(default_tier);
> +
> +static struct attribute *memory_tier_attrs[] = {
> +       &dev_attr_max_tier.attr,
> +       &dev_attr_default_tier.attr,
> +       NULL
> +};
> +
> +static const struct attribute_group memory_tier_attr_group = {
> +       .attrs = memory_tier_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_attr_groups[] = {
> +       &memory_tier_attr_group,
> +       NULL,
> +};
> +
> +static int __init memory_tier_init(void)
> +{
> +       int ret;
> +       struct memory_tier *memtier;
> +
> +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> +       if (ret)
> +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> +
> +       /*
> +        * Register only default memory tier to hide all empty
> +        * memory tier from sysfs.
> +        */
> +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> +       if (!memtier)
> +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> +       /* CPU only nodes are not part of memory tiers. */
> +       memtier->nodelist = node_states[N_MEMORY];
> +
> +       return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-06-07 22:51   ` Tim Chen
  2022-06-08  5:02     ` Aneesh Kumar K V
  2022-06-08  6:52     ` Ying Huang
  2022-06-08  6:50   ` Ying Huang
  2022-06-08  8:00   ` Ying Huang
  2 siblings, 2 replies; 84+ messages in thread
From: Tim Chen @ 2022-06-07 22:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> 
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target, nnodes, i;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +
> +	nnodes = nodes_weight(nd->preferred);
> +	if (!nnodes)
> +		return NUMA_NO_NODE;
> +
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	nnodes = get_random_int() % nnodes;
> +	target = first_node(nd->preferred);
> +	for (i = 0; i < nnodes; i++)
> +		target = next_node(target, nd->preferred);

We can simplify the above 4 lines.

	target = node_random(nd->preferred);

There's still a loop overhead though :(

> +
> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> 
> + */
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *_arg)
> +{
> +	struct memory_notify *arg = _arg;
> +
> +	/*
> +	 * Only update the node migration order when a node is
> +	 * changing status, like online->offline.
> +	 */
> +	if (arg->status_change_nid < 0)
> +		return notifier_from_errno(0);
> +
> +	switch (action) {
> +	case MEM_OFFLINE:
> +		/*
> +		 * In case we are moving out of N_MEMORY. Keep the node
> +		 * in the memory tier so that when we bring memory online,
> +		 * they appear in the right memory tier. We still need
> +		 * to rebuild the demotion order.
> +		 */
> +		mutex_lock(&memory_tier_lock);
> +		establish_migration_targets();
> +		mutex_unlock(&memory_tier_lock);
> +		break;
> +	case MEM_ONLINE:
> +		/*
> +		 * We ignore the error here, if the node already have the tier
> +		 * registered, we will continue to use that for the new memory
> +		 * we are adding here.
> +		 */
> +		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);

Should establish_migration_targets() be run here? Otherwise what are the
demotion targets for this newly onlined node?

> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
> +}
> +

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
@ 2022-06-07 23:40   ` Tim Chen
  2022-06-08  6:59   ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Tim Chen @ 2022-06-07 23:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> This patch adds the special string "none" as a supported memtier value
> that we can use to remove a specific node from being using as demotion target.

And also such node will not participate in promotion.  That is, hot memory in it will
not be promoted to other nodes.

> 
> For ex:
> :/sys/devices/system/node/node1# cat memtier
> 1
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 1-3
> :/sys/devices/system/node/node1# echo none > memtier
> :/sys/devices/system/node/node1#
> :/sys/devices/system/node/node1# cat memtier
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 2-3
> :/sys/devices/system/node/node1#
> 
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-07 21:32   ` Yang Shi
@ 2022-06-08  1:34     ` Ying Huang
  2022-06-08 16:37       ` Yang Shi
  2022-06-08  4:58     ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  1:34 UTC (permalink / raw)
  To: Yang Shi, Aneesh Kumar K.V
  Cc: Linux MM, Andrew Morton, Wei Xu, Greg Thelen, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> > 
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> > 
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> > 
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> > 
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> > 
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> > 
> > This patch series address the above by defining memory tiers explicitly.
> > 
> > This patch introduce explicity memory tiers with ranks. The rank
> > value of a memory tier is used to derive the demotion order between
> > NUMA nodes. The memory tiers present in a system can be found at
> > 
> > /sys/devices/system/memtier/memtierN/
> > 
> > The nodes which are part of a specific memory tier can be listed
> > via
> > /sys/devices/system/memtier/memtierN/nodelist
> > 
> > "Rank" is an opaque value. Its absolute value doesn't have any
> > special meaning. But the rank values of different memtiers can be
> > compared with each other to determine the memory tier order.
> > 
> > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > their rank values are 300, 200, 100, then the memory tier order is:
> > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > and memtier1 is the lowest tier.
> > 
> > The rank value of each memtier should be unique.
> > 
> > A higher rank memory tier will appear first in the demotion order
> > than a lower rank memory tier. ie. while reclaim we choose a node
> > in higher rank memory tier to demote pages to as compared to a node
> > in a lower rank memory tier.
> > 
> > For now we are not adding the dynamic number of memory tiers.
> > But a future series supporting that is possible. Currently
> > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > When doing memory hotplug, if not added to a memory tier, the NUMA
> > node gets added to DEFAULT_MEMORY_TIER(1).
> > 
> > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > 
> > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > 
> > Suggested-by: Wei Xu <weixugc@google.com>
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> >  include/linux/memory-tiers.h |  20 ++++
> >  mm/Kconfig                   |  11 ++
> >  mm/Makefile                  |   1 +
> >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> >  4 files changed, 220 insertions(+)
> >  create mode 100644 include/linux/memory-tiers.h
> >  create mode 100644 mm/memory-tiers.c
> > 
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > new file mode 100644
> > index 000000000000..e17f6b4ee177
> > --- /dev/null
> > +++ b/include/linux/memory-tiers.h
> > @@ -0,0 +1,20 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_TIERED_MEMORY
> > +
> > +#define MEMORY_TIER_HBM_GPU    0
> > +#define MEMORY_TIER_DRAM       1
> > +#define MEMORY_TIER_PMEM       2
> > +
> > +#define MEMORY_RANK_HBM_GPU    300
> > +#define MEMORY_RANK_DRAM       200
> > +#define MEMORY_RANK_PMEM       100
> > +
> > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIERS  3
> > +
> > +#endif /* CONFIG_TIERED_MEMORY */
> > +
> > +#endif
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..08a3d330740b 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >  config ARCH_ENABLE_THP_MIGRATION
> >         bool
> > 
> > +config TIERED_MEMORY
> > +       bool "Support for explicit memory tiers"
> > +       def_bool n
> > +       depends on MIGRATION && NUMA
> > +       help
> > +         Support to split nodes into memory tiers explicitly and
> > +         to demote pages on reclaim to lower tiers. This option
> > +         also exposes sysfs interface to read nodes available in
> > +         specific tier and to move specific node among different
> > +         possible tiers.
> 
> IMHO we should not need a new kernel config. If tiering is not present
> then there is just one tier on the system. And tiering is a kind of
> hardware configuration, the information could be shown regardless of
> whether demotion/promotion is supported/enabled or not.

I think so too.  At least it appears unnecessary to let the user turn
on/off it at configuration time.

All the code should be enclosed by #if defined(CONFIG_NUMA) &&
defined(CONIFIG_MIGRATION).  So we will not waste memory in small
systems.

Best Regards,
Huang, Ying

> > +
> >  config HUGETLB_PAGE_SIZE_VARIABLE
> >         def_bool n
> >         help
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 6f9ffa968a1a..482557fbc9d1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> >  obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMTEST)          += memtest.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > new file mode 100644
> > index 000000000000..7de18d94a08d
> > --- /dev/null
> > +++ b/mm/memory-tiers.c
> > @@ -0,0 +1,188 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/types.h>
> > +#include <linux/device.h>
> > +#include <linux/nodemask.h>
> > +#include <linux/slab.h>
> > +#include <linux/memory-tiers.h>
> > +
> > +struct memory_tier {
> > +       struct list_head list;
> > +       struct device dev;
> > +       nodemask_t nodelist;
> > +       int rank;
> > +};
> > +
> > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > +
> > +static struct bus_type memory_tier_subsys = {
> > +       .name = "memtier",
> > +       .dev_name = "memtier",
> > +};
> > +
> > +static DEFINE_MUTEX(memory_tier_lock);
> > +static LIST_HEAD(memory_tiers);
> > +
> > +
> > +static ssize_t nodelist_show(struct device *dev,
> > +                            struct device_attribute *attr, char *buf)
> > +{
> > +       struct memory_tier *memtier = to_memory_tier(dev);
> > +
> > +       return sysfs_emit(buf, "%*pbl\n",
> > +                         nodemask_pr_args(&memtier->nodelist));
> > +}
> > +static DEVICE_ATTR_RO(nodelist);
> > +
> > +static ssize_t rank_show(struct device *dev,
> > +                        struct device_attribute *attr, char *buf)
> > +{
> > +       struct memory_tier *memtier = to_memory_tier(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > +}
> > +static DEVICE_ATTR_RO(rank);
> > +
> > +static struct attribute *memory_tier_dev_attrs[] = {
> > +       &dev_attr_nodelist.attr,
> > +       &dev_attr_rank.attr,
> > +       NULL
> > +};
> > +
> > +static const struct attribute_group memory_tier_dev_group = {
> > +       .attrs = memory_tier_dev_attrs,
> > +};
> > +
> > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > +       &memory_tier_dev_group,
> > +       NULL
> > +};
> > +
> > +static void memory_tier_device_release(struct device *dev)
> > +{
> > +       struct memory_tier *tier = to_memory_tier(dev);
> > +
> > +       kfree(tier);
> > +}
> > +
> > +/*
> > + * Keep it simple by having  direct mapping between
> > + * tier index and rank value.
> > + */
> > +static inline int get_rank_from_tier(unsigned int tier)
> > +{
> > +       switch (tier) {
> > +       case MEMORY_TIER_HBM_GPU:
> > +               return MEMORY_RANK_HBM_GPU;
> > +       case MEMORY_TIER_DRAM:
> > +               return MEMORY_RANK_DRAM;
> > +       case MEMORY_TIER_PMEM:
> > +               return MEMORY_RANK_PMEM;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static void insert_memory_tier(struct memory_tier *memtier)
> > +{
> > +       struct list_head *ent;
> > +       struct memory_tier *tmp_memtier;
> > +
> > +       list_for_each(ent, &memory_tiers) {
> > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > +               if (tmp_memtier->rank < memtier->rank) {
> > +                       list_add_tail(&memtier->list, ent);
> > +                       return;
> > +               }
> > +       }
> > +       list_add_tail(&memtier->list, &memory_tiers);
> > +}
> > +
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +       int error;
> > +       struct memory_tier *memtier;
> > +
> > +       if (tier >= MAX_MEMORY_TIERS)
> > +               return NULL;
> > +
> > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +       if (!memtier)
> > +               return NULL;
> > +
> > +       memtier->dev.id = tier;
> > +       memtier->rank = get_rank_from_tier(tier);
> > +       memtier->dev.bus = &memory_tier_subsys;
> > +       memtier->dev.release = memory_tier_device_release;
> > +       memtier->dev.groups = memory_tier_dev_groups;
> > +
> > +       insert_memory_tier(memtier);
> > +
> > +       error = device_register(&memtier->dev);
> > +       if (error) {
> > +               list_del(&memtier->list);
> > +               put_device(&memtier->dev);
> > +               return NULL;
> > +       }
> > +       return memtier;
> > +}
> > +
> > +__maybe_unused // temporay to prevent warnings during bisects
> > +static void unregister_memory_tier(struct memory_tier *memtier)
> > +{
> > +       list_del(&memtier->list);
> > +       device_unregister(&memtier->dev);
> > +}
> > +
> > +static ssize_t
> > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > +}
> > +static DEVICE_ATTR_RO(max_tier);
> > +
> > +static ssize_t
> > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > +}
> > +static DEVICE_ATTR_RO(default_tier);
> > +
> > +static struct attribute *memory_tier_attrs[] = {
> > +       &dev_attr_max_tier.attr,
> > +       &dev_attr_default_tier.attr,
> > +       NULL
> > +};
> > +
> > +static const struct attribute_group memory_tier_attr_group = {
> > +       .attrs = memory_tier_attrs,
> > +};
> > +
> > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > +       &memory_tier_attr_group,
> > +       NULL,
> > +};
> > +
> > +static int __init memory_tier_init(void)
> > +{
> > +       int ret;
> > +       struct memory_tier *memtier;
> > +
> > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > +       if (ret)
> > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > +
> > +       /*
> > +        * Register only default memory tier to hide all empty
> > +        * memory tier from sysfs.
> > +        */
> > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > +       if (!memtier)
> > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > +
> > +       /* CPU only nodes are not part of memory tiers. */
> > +       memtier->nodelist = node_states[N_MEMORY];
> > +
> > +       return 0;
> > +}
> > +subsys_initcall(memory_tier_init);
> > +
> > --
> > 2.36.1
> > 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-07 18:43   ` Tim Chen
  2022-06-07 20:18     ` Wei Xu
@ 2022-06-08  4:30     ` Aneesh Kumar K V
  2022-06-08  6:06       ` Ying Huang
  2022-06-08  4:37     ` Aneesh Kumar K V
  2 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  4:30 UTC (permalink / raw)
  To: Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 12:13 AM, Tim Chen wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>
>>
>> The nodes which are part of a specific memory tier can be listed
>> via
>> /sys/devices/system/memtier/memtierN/nodelist
>>
>> "Rank" is an opaque value. Its absolute value doesn't have any
>> special meaning. But the rank values of different memtiers can be
>> compared with each other to determine the memory tier order.
>>
>> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>> their rank values are 300, 200, 100, then the memory tier order is:
>> memtier0 -> memtier2 -> memtier1,
> 
> Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> the order should be memtier0 -> memtier1 -> memtier2?
>                      (rank 300)  (rank 200)  (rank 100)
> 
>> where memtier0 is the highest tier
>> and memtier1 is the lowest tier.
> 
> I think memtier2 is the lowest as it has the lowest rank value.


typo error. Will fix that in the next update

>>
>> The rank value of each memtier should be unique.
>>
>>
>> +
>> +static void memory_tier_device_release(struct device *dev)
>> +{
>> +	struct memory_tier *tier = to_memory_tier(dev);
>> +
> 
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
> 
>> +	kfree(tier);
>> +}
>> +
>>
> ...
>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>> +{
>> +	int error;
>> +	struct memory_tier *memtier;
>> +
>> +	if (tier >= MAX_MEMORY_TIERS)
>> +		return NULL;
>> +
>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> +	if (!memtier)
>> +		return NULL;
>> +
>> +	memtier->dev.id = tier;
>> +	memtier->rank = get_rank_from_tier(tier);
>> +	memtier->dev.bus = &memory_tier_subsys;
>> +	memtier->dev.release = memory_tier_device_release;
>> +	memtier->dev.groups = memory_tier_dev_groups;
>> +
> 
> Should you take the mem_tier_lock before you insert to
> memtier-list?


Both register_memory_tier and unregister_memory_tier get called with 
memory_tier_lock held.

> 
>> +	insert_memory_tier(memtier);
>> +
>> +	error = device_register(&memtier->dev);
>> +	if (error) {
>> +		list_del(&memtier->list);
>> +		put_device(&memtier->dev);
>> +		return NULL;
>> +	}
>> +	return memtier;
>> +}
>> +
>> +__maybe_unused // temporay to prevent warnings during bisects
>> +static void unregister_memory_tier(struct memory_tier *memtier)
>> +{
> 
> I think we should take mem_tier_lock before modifying memtier->list.
> 

unregister_memory_tier get called with memory_tier_lock held.

>> +	list_del(&memtier->list);
>> +	device_unregister(&memtier->dev);
>> +}
>> +
>>

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-07 18:43   ` Tim Chen
  2022-06-07 20:18     ` Wei Xu
  2022-06-08  4:30     ` Aneesh Kumar K V
@ 2022-06-08  4:37     ` Aneesh Kumar K V
  2022-06-08  6:10       ` Ying Huang
  2 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  4:37 UTC (permalink / raw)
  To: Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 12:13 AM, Tim Chen wrote:
...

>>
>> +
>> +static void memory_tier_device_release(struct device *dev)
>> +{
>> +	struct memory_tier *tier = to_memory_tier(dev);
>> +
> 
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
> 
>> +	kfree(tier);
>> +}
>> +
>>
> ...

The lifecycle of the memory_tier struct is tied to the sysfs device life 
time. ie, memory_tier_device_relese get called only after the last 
reference on that sysfs dev object is released. Hence we can be sure 
there is no userspace that is keeping one of the memtier related sysfs 
file open.

W.r.t other memory device sharing the same memtier, we unregister the
sysfs device only when the memory tier nodelist is empty. That is no 
memory device is present in this memory tier.

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-07 20:15   ` Tim Chen
@ 2022-06-08  4:55     ` Aneesh Kumar K V
  2022-06-08  6:42       ` Ying Huang
  2022-06-08 16:06       ` Tim Chen
  0 siblings, 2 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  4:55 UTC (permalink / raw)
  To: Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 1:45 AM, Tim Chen wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>
>>   
>> +static struct memory_tier *__node_get_memory_tier(int node)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	list_for_each_entry(memtier, &memory_tiers, list) {
> 
> We could need to map node to mem_tier quite often, if we need
> to account memory usage at tier level.  It will be more efficient
> to have a pointer from node (pgdat) to memtier rather
> than doing a search through the list.
> 
> 

That is something I was actively trying to avoid. Currently all struct 
memory_tier references are with memory_tier_lock mutex held. That 
simplify the locking and reference counting.

As of now we are able to implement all the required interfaces without 
pgdat having pointers to struct memory_tier. We can update pgdat with 
memtier details when we are implementing changes requiring those. We 
could keep additional memtier->dev reference to make sure memory tiers 
are not destroyed while other part of the kernel is referencing the 
same. But IMHO such changes should wait till we have users for the same.


>> +		if (node_isset(node, memtier->nodelist))
>> +			return memtier;
>> +	}
>> +	return NULL;
>> +}
>> +
>>
> 
> Tim
>

-aneesh


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-07 21:32   ` Yang Shi
  2022-06-08  1:34     ` Ying Huang
@ 2022-06-08  4:58     ` Aneesh Kumar K V
  2022-06-08  6:18       ` Ying Huang
  2022-06-08 16:42       ` Yang Shi
  1 sibling, 2 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  4:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux MM, Andrew Morton, Wei Xu, Huang Ying, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 3:02 AM, Yang Shi wrote:
> On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created
>> during the kernel initialization and updated when a NUMA node is
>> hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and builds the tier hierarchy
>> tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases,
>>
>> The current tier initialization code always initializes
>> each memory-only NUMA node into a lower tier.  But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top
>> tier. But on a system with HBM or GPU devices, the
>> memory-only NUMA nodes mapping these devices should be in the
>> top tier, and DRAM nodes with CPUs are better to be placed into the
>> next lower tier.
>>
>> With current kernel higher tier node can only be demoted to selected nodes on the
>> next lower tier as defined by the demotion path, not any other
>> node from any lower tier.  This strict, hard-coded demotion order
>> does not work in all use cases (e.g. some use cases may want to
>> allow cross-socket demotion to another node in the same demotion
>> tier as a fallback when the preferred demotion node is out of
>> space), This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from
>> any lower tier, whereas the demotion order doesn't allow that.
>>
>> The current kernel also don't provide any interfaces for the
>> userspace to learn about the memory tier hierarchy in order to
>> optimize its memory allocations.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> This patch introduce explicity memory tiers with ranks. The rank
>> value of a memory tier is used to derive the demotion order between
>> NUMA nodes. The memory tiers present in a system can be found at
>>
>> /sys/devices/system/memtier/memtierN/
>>
>> The nodes which are part of a specific memory tier can be listed
>> via
>> /sys/devices/system/memtier/memtierN/nodelist
>>
>> "Rank" is an opaque value. Its absolute value doesn't have any
>> special meaning. But the rank values of different memtiers can be
>> compared with each other to determine the memory tier order.
>>
>> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>> their rank values are 300, 200, 100, then the memory tier order is:
>> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
>> and memtier1 is the lowest tier.
>>
>> The rank value of each memtier should be unique.
>>
>> A higher rank memory tier will appear first in the demotion order
>> than a lower rank memory tier. ie. while reclaim we choose a node
>> in higher rank memory tier to demote pages to as compared to a node
>> in a lower rank memory tier.
>>
>> For now we are not adding the dynamic number of memory tiers.
>> But a future series supporting that is possible. Currently
>> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
>> When doing memory hotplug, if not added to a memory tier, the NUMA
>> node gets added to DEFAULT_MEMORY_TIER(1).
>>
>> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
>>
>> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>
>> Suggested-by: Wei Xu <weixugc@google.com>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   include/linux/memory-tiers.h |  20 ++++
>>   mm/Kconfig                   |  11 ++
>>   mm/Makefile                  |   1 +
>>   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
>>   4 files changed, 220 insertions(+)
>>   create mode 100644 include/linux/memory-tiers.h
>>   create mode 100644 mm/memory-tiers.c
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> new file mode 100644
>> index 000000000000..e17f6b4ee177
>> --- /dev/null
>> +++ b/include/linux/memory-tiers.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_TIERED_MEMORY
>> +
>> +#define MEMORY_TIER_HBM_GPU    0
>> +#define MEMORY_TIER_DRAM       1
>> +#define MEMORY_TIER_PMEM       2
>> +
>> +#define MEMORY_RANK_HBM_GPU    300
>> +#define MEMORY_RANK_DRAM       200
>> +#define MEMORY_RANK_PMEM       100
>> +
>> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIERS  3
>> +
>> +#endif /* CONFIG_TIERED_MEMORY */
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 169e64192e48..08a3d330740b 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>>   config ARCH_ENABLE_THP_MIGRATION
>>          bool
>>
>> +config TIERED_MEMORY
>> +       bool "Support for explicit memory tiers"
>> +       def_bool n
>> +       depends on MIGRATION && NUMA
>> +       help
>> +         Support to split nodes into memory tiers explicitly and
>> +         to demote pages on reclaim to lower tiers. This option
>> +         also exposes sysfs interface to read nodes available in
>> +         specific tier and to move specific node among different
>> +         possible tiers.
> 
> IMHO we should not need a new kernel config. If tiering is not present
> then there is just one tier on the system. And tiering is a kind of
> hardware configuration, the information could be shown regardless of
> whether demotion/promotion is supported/enabled or not.
> 

This was added so that we could avoid doing multiple

#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)

Initially I had that as def_bool y and depends on MIGRATION && NUMA. But 
it was later suggested that def_bool is not recommended for newer config.

How about

  config TIERED_MEMORY
  	bool "Support for explicit memory tiers"
-	def_bool n
-	depends on MIGRATION && NUMA
-	help
-	  Support to split nodes into memory tiers explicitly and
-	  to demote pages on reclaim to lower tiers. This option
-	  also exposes sysfs interface to read nodes available in
-	  specific tier and to move specific node among different
-	  possible tiers.
+	def_bool MIGRATION && NUMA

  config HUGETLB_PAGE_SIZE_VARIABLE
  	def_bool n

ie, we just make it a Kconfig variable without exposing it to the user?

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-07 22:51   ` Tim Chen
@ 2022-06-08  5:02     ` Aneesh Kumar K V
  2022-06-08  6:52     ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  5:02 UTC (permalink / raw)
  To: Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 4:21 AM, Tim Chen wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>
>> +int next_demotion_node(int node)
>> +{
>> +	struct demotion_nodes *nd;
>> +	int target, nnodes, i;
>> +
>> +	if (!node_demotion)
>> +		return NUMA_NO_NODE;
>> +
>> +	nd = &node_demotion[node];
>> +
>> +	/*
>> +	 * node_demotion[] is updated without excluding this
>> +	 * function from running.
>> +	 *
>> +	 * Make sure to use RCU over entire code blocks if
>> +	 * node_demotion[] reads need to be consistent.
>> +	 */
>> +	rcu_read_lock();
>> +
>> +	nnodes = nodes_weight(nd->preferred);
>> +	if (!nnodes)
>> +		return NUMA_NO_NODE;
>> +
>> +	/*
>> +	 * If there are multiple target nodes, just select one
>> +	 * target node randomly.
>> +	 *
>> +	 * In addition, we can also use round-robin to select
>> +	 * target node, but we should introduce another variable
>> +	 * for node_demotion[] to record last selected target node,
>> +	 * that may cause cache ping-pong due to the changing of
>> +	 * last target node. Or introducing per-cpu data to avoid
>> +	 * caching issue, which seems more complicated. So selecting
>> +	 * target node randomly seems better until now.
>> +	 */
>> +	nnodes = get_random_int() % nnodes;
>> +	target = first_node(nd->preferred);
>> +	for (i = 0; i < nnodes; i++)
>> +		target = next_node(target, nd->preferred);
> 
> We can simplify the above 4 lines.
> 
> 	target = node_random(nd->preferred);
> 
> There's still a loop overhead though :(
> 

Will fix in next update.

>> +
>> +	rcu_read_unlock();
>> +
>> +	return target;
>> +}
>> +
>>

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  4:30     ` Aneesh Kumar K V
@ 2022-06-08  6:06       ` Ying Huang
  0 siblings, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:06 UTC (permalink / raw)
  To: Aneesh Kumar K V, Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 10:00 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:13 AM, Tim Chen wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > 
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1,
> > 
> > Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> > the order should be memtier0 -> memtier1 -> memtier2?
> >                      (rank 300)  (rank 200)  (rank 100)
> > 
> > > where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > 
> > I think memtier2 is the lowest as it has the lowest rank value.
> 
> 
> typo error. Will fix that in the next update
> 
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > 
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +	struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > 
> > Do we need some ref counts on memory_tier?
> > If there is another device still using the same memtier,
> > free below could cause problem.
> > 
> > > +	kfree(tier);
> > > +}
> > > +
> > > 
> > ...
> > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > +{
> > > +	int error;
> > > +	struct memory_tier *memtier;
> > > +
> > > +	if (tier >= MAX_MEMORY_TIERS)
> > > +		return NULL;
> > > +
> > > +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > +	if (!memtier)
> > > +		return NULL;
> > > +
> > > +	memtier->dev.id = tier;
> > > +	memtier->rank = get_rank_from_tier(tier);
> > > +	memtier->dev.bus = &memory_tier_subsys;
> > > +	memtier->dev.release = memory_tier_device_release;
> > > +	memtier->dev.groups = memory_tier_dev_groups;
> > > +
> > 
> > Should you take the mem_tier_lock before you insert to
> > memtier-list?
> 
> 
> Both register_memory_tier and unregister_memory_tier get called with 
> memory_tier_lock held.

Then please add locking requirements to the comments above these
functions.

Best Regards,
Huang, Ying

> > 
> > > +	insert_memory_tier(memtier);
> > > +
> > > +	error = device_register(&memtier->dev);
> > > +	if (error) {
> > > +		list_del(&memtier->list);
> > > +		put_device(&memtier->dev);
> > > +		return NULL;
> > > +	}
> > > +	return memtier;
> > > +}
> > > +
> > > +__maybe_unused // temporay to prevent warnings during bisects
> > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > +{
> > 
> > I think we should take mem_tier_lock before modifying memtier->list.
> > 
> 
> unregister_memory_tier get called with memory_tier_lock held.
> 
> > > +	list_del(&memtier->list);
> > > +	device_unregister(&memtier->dev);
> > > +}
> > > +
> > > 
> 
> -aneesh



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  4:37     ` Aneesh Kumar K V
@ 2022-06-08  6:10       ` Ying Huang
  2022-06-08  8:04         ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:10 UTC (permalink / raw)
  To: Aneesh Kumar K V, Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:13 AM, Tim Chen wrote:
> ...
> 
> > > 
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +	struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > 
> > Do we need some ref counts on memory_tier?
> > If there is another device still using the same memtier,
> > free below could cause problem.
> > 
> > > +	kfree(tier);
> > > +}
> > > +
> > > 
> > ...
> 
> The lifecycle of the memory_tier struct is tied to the sysfs device life 
> time. ie, memory_tier_device_relese get called only after the last 
> reference on that sysfs dev object is released. Hence we can be sure 
> there is no userspace that is keeping one of the memtier related sysfs 
> file open.
> 
> W.r.t other memory device sharing the same memtier, we unregister the
> sysfs device only when the memory tier nodelist is empty. That is no 
> memory device is present in this memory tier.

memory_tier isn't only used by user space.  It is used inside kernel
too.  If some kernel code get a pointer to struct memory_tier, we need
to guarantee the pointer will not be freed under us.  And as Tim pointed
out, we need to use it in hot path (for statistics), so some kind of rcu
lock may be good.

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  4:58     ` Aneesh Kumar K V
@ 2022-06-08  6:18       ` Ying Huang
  2022-06-08 16:42       ` Yang Shi
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:18 UTC (permalink / raw)
  To: Aneesh Kumar K V, Yang Shi
  Cc: Linux MM, Andrew Morton, Wei Xu, Greg Thelen, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 2022-06-08 at 10:28 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 3:02 AM, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > > 
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > > 
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > > 
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > > 
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > > 
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > > 
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > > 
> > > This patch series address the above by defining memory tiers explicitly.
> > > 
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > > 
> > > /sys/devices/system/memtier/memtierN/
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > > 
> > > For now we are not adding the dynamic number of memory tiers.
> > > But a future series supporting that is possible. Currently
> > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > node gets added to DEFAULT_MEMORY_TIER(1).
> > > 
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > 
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > 
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   include/linux/memory-tiers.h |  20 ++++
> > >   mm/Kconfig                   |  11 ++
> > >   mm/Makefile                  |   1 +
> > >   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > >   4 files changed, 220 insertions(+)
> > >   create mode 100644 include/linux/memory-tiers.h
> > >   create mode 100644 mm/memory-tiers.c
> > > 
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU    0
> > > +#define MEMORY_TIER_DRAM       1
> > > +#define MEMORY_TIER_PMEM       2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU    300
> > > +#define MEMORY_RANK_DRAM       200
> > > +#define MEMORY_RANK_PMEM       100
> > > +
> > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..08a3d330740b 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >   config ARCH_ENABLE_THP_MIGRATION
> > >          bool
> > > 
> > > +config TIERED_MEMORY
> > > +       bool "Support for explicit memory tiers"
> > > +       def_bool n
> > > +       depends on MIGRATION && NUMA
> > > +       help
> > > +         Support to split nodes into memory tiers explicitly and
> > > +         to demote pages on reclaim to lower tiers. This option
> > > +         also exposes sysfs interface to read nodes available in
> > > +         specific tier and to move specific node among different
> > > +         possible tiers.
> > 
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
> > 
> 
> This was added so that we could avoid doing multiple
> 
> #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> 
> Initially I had that as def_bool y and depends on MIGRATION && NUMA. But 
> it was later suggested that def_bool is not recommended for newer config.
> 
> How about
> 
>   config TIERED_MEMORY
>   	bool "Support for explicit memory tiers"

Need to remove this line too to make it invisible for users?

Best Regards,
HUang, Ying

> -	def_bool n
> -	depends on MIGRATION && NUMA
> -	help
> -	  Support to split nodes into memory tiers explicitly and
> -	  to demote pages on reclaim to lower tiers. This option
> -	  also exposes sysfs interface to read nodes available in
> -	  specific tier and to move specific node among different
> -	  possible tiers.
> +	def_bool MIGRATION && NUMA
> 
>   config HUGETLB_PAGE_SIZE_VARIABLE
>   	def_bool n
> 
> ie, we just make it a Kconfig variable without exposing it to the user?
> 
> -aneesh



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-08  4:55     ` Aneesh Kumar K V
@ 2022-06-08  6:42       ` Ying Huang
  2022-06-08 16:06       ` Tim Chen
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:42 UTC (permalink / raw)
  To: Aneesh Kumar K V, Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 10:25 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 1:45 AM, Tim Chen wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > 
> > >   
> > > 
> > > 
> > > 
> > > +static struct memory_tier *__node_get_memory_tier(int node)
> > > +{
> > > +	struct memory_tier *memtier;
> > > +
> > > +	list_for_each_entry(memtier, &memory_tiers, list) {
> > 
> > We could need to map node to mem_tier quite often, if we need
> > to account memory usage at tier level.  It will be more efficient
> > to have a pointer from node (pgdat) to memtier rather
> > than doing a search through the list.
> > 
> > 
> 
> That is something I was actively trying to avoid. Currently all struct 
> memory_tier references are with memory_tier_lock mutex held. That 
> simplify the locking and reference counting.
> 
> As of now we are able to implement all the required interfaces without 
> pgdat having pointers to struct memory_tier. We can update pgdat with 
> memtier details when we are implementing changes requiring those. We 
> could keep additional memtier->dev reference to make sure memory tiers 
> are not destroyed while other part of the kernel is referencing the 
> same. But IMHO such changes should wait till we have users for the same.

No.  We need a convenient way to access memory tier information from
inside the kernel.  For example, from nid to memory tier rank, this is
needed by migrate_misplaced_page() to do statistics too, iterate all
nodes of a memory tier, etc.

And, "allowed" field of struct demotion_nodes (introduced in [7/9] is
per-memory tier instead of per-node.  Please move it to struct
memory_tier.  And we just need a convenient way to access it.

All these are not complex, unless you insist to use memory_tier_lock and
device liftcycle to manage this in-kernel data structure.

Best Regards,
Huang, Ying

> > > +		if (node_isset(node, memtier->nodelist))
> > > +			return memtier;
> > > +	}
> > > +	return NULL;
> > > +}
> > > +
> > > 
> > 
> > Tim
> > 
> 
> -aneesh
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
  2022-06-07 22:51   ` Tim Chen
@ 2022-06-08  6:50   ` Ying Huang
  2022-06-08  8:19     ` Aneesh Kumar K V
  2022-06-08  8:00   ` Ying Huang
  2 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> This patch switch the demotion target building logic to use memory tiers
> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
> default tier 1 and additional memory tiers will be added by drivers like
> dax kmem.
> 
> This patch builds the demotion target for a NUMA node by looking at all
> memory tiers below the tier to which the NUMA node belongs. The closest node
> in the immediately following memory tier is used as a demotion target.
> 
> Since we are now only building demotion target for N_MEMORY NUMA nodes
> the CPU hotplug calls are removed in this patch.
> 
> The rank approach allows us to keep memory tier device IDs stable even if there
> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> or lower than these nodes. A new memory tier can be inserted into the tier
> hierarchy for a new set of nodes without affecting the node assignment of any
> existing memtier, provided that there is enough gap in the rank values for the
> new memtier.
> 
> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> Its value relative to other memtiers decides the level of this memtier in the tier
> hierarchy.
> 
> For now, This patch supports hardcoded rank values which are 300, 200, & 100 for
> memory tiers 0,1 & 2 respectively.
> 
> Below is the sysfs interface to read the rank values of memory tier,
> /sys/devices/system/memtier/memtierN/rank
> 
> This interface is read only for now. Write support can be added when there is
> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> requirement among them.
> 
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |   5 +
>  include/linux/migrate.h      |  13 --
>  mm/memory-tiers.c            | 269 ++++++++++++++++++++++++
>  mm/migrate.c                 | 394 -----------------------------------
>  mm/vmstat.c                  |   4 -
>  5 files changed, 274 insertions(+), 411 deletions(-)

It appears that you moved some code from migrate.c to memory-tiers.c and
change them.  If so, please separate the change.  That is, one patch
only move the code, the other change the code.  This will make it easier
to find out what is changed.

> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 33ef36395a20..adc2cb3bf5f8 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -16,11 +16,16 @@
>  #define MAX_MEMORY_TIERS  3
>  
> 
> 
> 
> 
> 
> 
> 
>  extern bool numa_demotion_enabled;
> +int next_demotion_node(int node);
>  int node_get_memory_tier_id(int node);
>  int node_set_memory_tier(int node, int tier);
>  int node_reset_memory_tier(int node, int tier);
>  #else
>  #define numa_demotion_enabled	false
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
>  
> 
> 
> 
> 
> 
> 
> 
>  #endif	/* CONFIG_TIERED_MEMORY */
>  
> 
> 
> 
> 
> 
> 
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 43e737215f33..93fab62e6548 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  
> 
> 
> 
> 
> 
> 
> 
>  #endif /* CONFIG_MIGRATION */
>  
> 
> 
> 
> 
> 
> 
> 
> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> -extern void set_migration_target_nodes(void);
> -extern void migrate_on_reclaim_init(void);
> -extern int next_demotion_node(int node);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -static inline void migrate_on_reclaim_init(void) {}
> -static inline int next_demotion_node(int node)
> -{
> -        return NUMA_NO_NODE;
> -}
> -#endif
> -
>  #ifdef CONFIG_COMPACTION
>  extern int PageMovable(struct page *page);
>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 3f382d1f844a..0d05c0bfb79b 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -4,6 +4,10 @@
>  #include <linux/nodemask.h>
>  #include <linux/slab.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/random.h>
> +#include <linux/memory.h>
> +
> +#include "internal.h"
>  
> 
> 
> 
> 
> 
> 
> 
>  struct memory_tier {
>  	struct list_head list;
> @@ -12,6 +16,10 @@ struct memory_tier {
>  	int rank;
>  };
>  
> 
> 
> 
> 
> 
> 
> 
> +struct demotion_nodes {
> +	nodemask_t preferred;
> +};
> +
>  #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
>  
> 
> 
> 
> 
> 
> 
> 
>  static struct bus_type memory_tier_subsys = {
> @@ -19,9 +27,71 @@ static struct bus_type memory_tier_subsys = {
>  	.dev_name = "memtier",
>  };
>  
> 
> 
> 
> 
> 
> 
> 
> +static void establish_migration_targets(void);
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  
> 
> 
> 
> 
> 
> 
> 
> +/*
> + * node_demotion[] examples:
> + *
> + * Example 1:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> + *
> + * node distances:
> + * node   0    1    2    3
> + *    0  10   20   30   40
> + *    1  20   10   40   30
> + *    2  30   40   10   40
> + *    3  40   30   40   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-1
> + * memory_tiers[2] = 2-3
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 3
> + * node_demotion[2].preferred = <empty>
> + * node_demotion[3].preferred = <empty>
> + *
> + * Example 2:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   30
> + *    2  30   30   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-2
> + * memory_tiers[2] = <empty>
> + *
> + * node_demotion[0].preferred = <empty>
> + * node_demotion[1].preferred = <empty>
> + * node_demotion[2].preferred = <empty>
> + *
> + * Example 3:
> + *
> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   40
> + *    2  30   40   10
> + *
> + * memory_tiers[0] = 1
> + * memory_tiers[1] = 0
> + * memory_tiers[2] = 2
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 0
> + * node_demotion[2].preferred = <empty>
> + *
> + */
> +static struct demotion_nodes *node_demotion __read_mostly;
>  
> 
> 
> 
> 
> 
> 
> 
>  static ssize_t nodelist_show(struct device *dev,
>  			     struct device_attribute *attr, char *buf)
> @@ -202,6 +272,7 @@ static void node_remove_from_memory_tier(int node)
>  	if (nodes_empty(memtier->nodelist))
>  		unregister_memory_tier(memtier);
>  
> 
> 
> 
> 
> 
> 
> 
> +	establish_migration_targets();
>  out:
>  	mutex_unlock(&memory_tier_lock);
>  }
> @@ -263,6 +334,8 @@ int node_reset_memory_tier(int node, int tier)
>  
> 
> 
> 
> 
> 
> 
> 
>  	if (nodes_empty(current_tier->nodelist))
>  		unregister_memory_tier(current_tier);
> +
> +	establish_migration_targets();
>  out:
>  	mutex_unlock(&memory_tier_lock);
>  
> 
> 
> 
> 
> 
> 
> 
> @@ -276,13 +349,208 @@ int node_set_memory_tier(int node, int tier)
>  
> 
> 
> 
> 
> 
> 
> 
>  	mutex_lock(&memory_tier_lock);
>  	memtier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> +	 * will have skipped this node.
> +	 */
>  	if (!memtier)
>  		ret = __node_set_memory_tier(node, tier);
> +	establish_migration_targets();
> +
>  	mutex_unlock(&memory_tier_lock);
>  
> 
> 
> 
> 
> 
> 
> 
>  	return ret;
>  }
>  
> 
> 
> 
> 
> 
> 
> 
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * Return: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target, nnodes, i;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +
> +	nnodes = nodes_weight(nd->preferred);
> +	if (!nnodes)
> +		return NUMA_NO_NODE;
> +
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	nnodes = get_random_int() % nnodes;
> +	target = first_node(nd->preferred);
> +	for (i = 0; i < nnodes; i++)
> +		target = next_node(target, nd->preferred);
> +
> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> +/* Disable reclaim-based migration. */
> +static void __disable_all_migrate_targets(void)
> +{
> +	int node;
> +
> +	for_each_node_mask(node, node_states[N_MEMORY])
> +		node_demotion[node].preferred = NODE_MASK_NONE;
> +}
> +
> +static void disable_all_migrate_targets(void)
> +{
> +	__disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 */
> +	synchronize_rcu();
> +}
> +
> +/*
> + * Find an automatic demotion target for all memory
> + * nodes. Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static void establish_migration_targets(void)
> +{
> +	struct memory_tier *memtier;
> +	struct demotion_nodes *nd;
> +	int target = NUMA_NO_NODE, node;
> +	int distance, best_distance;
> +	nodemask_t used;
> +
> +	if (!node_demotion)
> +		return;
> +
> +	disable_all_migrate_targets();
> +
> +	for_each_node_mask(node, node_states[N_MEMORY]) {
> +		best_distance = -1;
> +		nd = &node_demotion[node];
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
> +			continue;
> +		/*
> +		 * Get the next memtier to find the  demotion node list.
> +		 */
> +		memtier = list_next_entry(memtier, list);
> +
> +		/*
> +		 * find_next_best_node, use 'used' nodemask as a skip list.
> +		 * Add all memory nodes except the selected memory tier
> +		 * nodelist to skip list so that we find the best node from the
> +		 * memtier nodelist.
> +		 */
> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
> +
> +		/*
> +		 * Find all the nodes in the memory tier node list of same best distance.
> +		 * add them to the preferred mask. We randomly select between nodes
> +		 * in the preferred mask when allocating pages during demotion.
> +		 */
> +		do {
> +			target = find_next_best_node(node, &used);
> +			if (target == NUMA_NO_NODE)
> +				break;
> +
> +			distance = node_distance(node, target);
> +			if (distance == best_distance || best_distance == -1) {
> +				best_distance = distance;
> +				node_set(target, nd->preferred);
> +			} else {
> +				break;
> +			}
> +		} while (1);
> +	}
> +}
> +
> +/*
> + * This runs whether reclaim-based migration is enabled or not,
> + * which ensures that the user can turn reclaim-based migration
> + * at any time without needing to recalculate migration targets.
> + */
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *_arg)
> +{
> +	struct memory_notify *arg = _arg;
> +
> +	/*
> +	 * Only update the node migration order when a node is
> +	 * changing status, like online->offline.
> +	 */
> +	if (arg->status_change_nid < 0)
> +		return notifier_from_errno(0);
> +
> +	switch (action) {
> +	case MEM_OFFLINE:
> +		/*
> +		 * In case we are moving out of N_MEMORY. Keep the node
> +		 * in the memory tier so that when we bring memory online,
> +		 * they appear in the right memory tier. We still need
> +		 * to rebuild the demotion order.
> +		 */
> +		mutex_lock(&memory_tier_lock);
> +		establish_migration_targets();
> +		mutex_unlock(&memory_tier_lock);
> +		break;
> +	case MEM_ONLINE:
> +		/*
> +		 * We ignore the error here, if the node already have the tier
> +		 * registered, we will continue to use that for the new memory
> +		 * we are adding here.
> +		 */
> +		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
> +}
> +
> +static void __init migrate_on_reclaim_init(void)
> +{
> +	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
> +				GFP_KERNEL);

In the original code, this is

-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);

Why change nr_node_ids to MAX_NUMNODES?  If we need to use MAX_NUMNODES,
we can define node_demotion statically.

If you separate this patch as "move" and "change", this kind of change
will be easier to be found.

Best Regards,
Huang, Ying

> +	WARN_ON(!node_demotion);
> +
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> +}
> +
>  static int __init memory_tier_init(void)
>  {
>  	int ret;
> @@ -302,6 +570,7 @@ static int __init memory_tier_init(void)
>  
> 
> 
> 
> 
> 
> 
> 
>  	/* CPU only nodes are not part of memory tiers. */
>  	memtier->nodelist = node_states[N_MEMORY];
> +	migrate_on_reclaim_init();
>  
> 
> 
> 
> 
> 
> 
> 
>  	return 0;
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 29cacc217e38..0b554625a219 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2116,398 +2116,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -/*
> - * node_demotion[] example:
> - *
> - * Consider a system with two sockets.  Each socket has
> - * three classes of memory attached: fast, medium and slow.
> - * Each memory class is placed in its own NUMA node.  The
> - * CPUs are placed in the node with the "fast" memory.  The
> - * 6 NUMA nodes (0-5) might be split among the sockets like
> - * this:
> - *
> - *	Socket A: 0, 1, 2
> - *	Socket B: 3, 4, 5
> - *
> - * When Node 0 fills up, its memory should be migrated to
> - * Node 1.  When Node 1 fills up, it should be migrated to
> - * Node 2.  The migration path start on the nodes with the
> - * processors (since allocations default to this node) and
> - * fast memory, progress through medium and end with the
> - * slow memory:
> - *
> - *	0 -> 1 -> 2 -> stop
> - *	3 -> 4 -> 5 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *
> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
> - *
> - * Moreover some systems may have multiple slow memory nodes.
> - * Suppose a system has one socket with 3 memory nodes, node 0
> - * is fast memory type, and node 1/2 both are slow memory
> - * type, and the distance between fast memory node and slow
> - * memory node is same. So the migration path should be:
> - *
> - *	0 -> 1/2 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
> - */
> -
> -/*
> - * Writes to this array occur without locking.  Cycles are
> - * not allowed: Node X demotes to Y which demotes to X...
> - *
> - * If multiple reads are performed, a single rcu_read_lock()
> - * must be held over all reads to ensure that no cycles are
> - * observed.
> - */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -	unsigned short nr;
> -	short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> -
> -/**
> - * next_demotion_node() - Get the next node in the demotion path
> - * @node: The starting node to lookup the next node
> - *
> - * Return: node id for next memory node in the demotion path hierarchy
> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> - * @node online or guarantee that it *continues* to be the next demotion
> - * target.
> - */
> -int next_demotion_node(int node)
> -{
> -	struct demotion_nodes *nd;
> -	unsigned short target_nr, index;
> -	int target;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	/*
> -	 * node_demotion[] is updated without excluding this
> -	 * function from running.  RCU doesn't provide any
> -	 * compiler barriers, so the READ_ONCE() is required
> -	 * to avoid compiler reordering or read merging.
> -	 *
> -	 * Make sure to use RCU over entire code blocks if
> -	 * node_demotion[] reads need to be consistent.
> -	 */
> -	rcu_read_lock();
> -	target_nr = READ_ONCE(nd->nr);
> -
> -	switch (target_nr) {
> -	case 0:
> -		target = NUMA_NO_NODE;
> -		goto out;
> -	case 1:
> -		index = 0;
> -		break;
> -	default:
> -		/*
> -		 * If there are multiple target nodes, just select one
> -		 * target node randomly.
> -		 *
> -		 * In addition, we can also use round-robin to select
> -		 * target node, but we should introduce another variable
> -		 * for node_demotion[] to record last selected target node,
> -		 * that may cause cache ping-pong due to the changing of
> -		 * last target node. Or introducing per-cpu data to avoid
> -		 * caching issue, which seems more complicated. So selecting
> -		 * target node randomly seems better until now.
> -		 */
> -		index = get_random_int() % target_nr;
> -		break;
> -	}
> -
> -	target = READ_ONCE(nd->nodes[index]);
> -
> -out:
> -	rcu_read_unlock();
> -	return target;
> -}
> -
> -/* Disable reclaim-based migration. */
> -static void __disable_all_migrate_targets(void)
> -{
> -	int node, i;
> -
> -	if (!node_demotion)
> -		return;
> -
> -	for_each_online_node(node) {
> -		node_demotion[node].nr = 0;
> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
> -	}
> -}
> -
> -static void disable_all_migrate_targets(void)
> -{
> -	__disable_all_migrate_targets();
> -
> -	/*
> -	 * Ensure that the "disable" is visible across the system.
> -	 * Readers will see either a combination of before+disable
> -	 * state or disable+after.  They will never see before and
> -	 * after state together.
> -	 *
> -	 * The before+after state together might have cycles and
> -	 * could cause readers to do things like loop until this
> -	 * function finishes.  This ensures they can only see a
> -	 * single "bad" read and would, for instance, only loop
> -	 * once.
> -	 */
> -	synchronize_rcu();
> -}
> -
> -/*
> - * Find an automatic demotion target for 'node'.
> - * Failing here is OK.  It might just indicate
> - * being at the end of a chain.
> - */
> -static int establish_migrate_target(int node, nodemask_t *used,
> -				    int best_distance)
> -{
> -	int migration_target, index, val;
> -	struct demotion_nodes *nd;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	migration_target = find_next_best_node(node, used);
> -	if (migration_target == NUMA_NO_NODE)
> -		return NUMA_NO_NODE;
> -
> -	/*
> -	 * If the node has been set a migration target node before,
> -	 * which means it's the best distance between them. Still
> -	 * check if this node can be demoted to other target nodes
> -	 * if they have a same best distance.
> -	 */
> -	if (best_distance != -1) {
> -		val = node_distance(node, migration_target);
> -		if (val > best_distance)
> -			goto out_clear;
> -	}
> -
> -	index = nd->nr;
> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
> -		      "Exceeds maximum demotion target nodes\n"))
> -		goto out_clear;
> -
> -	nd->nodes[index] = migration_target;
> -	nd->nr++;
> -
> -	return migration_target;
> -out_clear:
> -	node_clear(migration_target, *used);
> -	return NUMA_NO_NODE;
> -}
> -
> -/*
> - * When memory fills up on a node, memory contents can be
> - * automatically migrated to another node instead of
> - * discarded at reclaim.
> - *
> - * Establish a "migration path" which will start at nodes
> - * with CPUs and will follow the priorities used to build the
> - * page allocator zonelists.
> - *
> - * The difference here is that cycles must be avoided.  If
> - * node0 migrates to node1, then neither node1, nor anything
> - * node1 migrates to can migrate to node0. Also one node can
> - * be migrated to multiple nodes if the target nodes all have
> - * a same best-distance against the source node.
> - *
> - * This function can run simultaneously with readers of
> - * node_demotion[].  However, it can not run simultaneously
> - * with itself.  Exclusion is provided by memory hotplug events
> - * being single-threaded.
> - */
> -static void __set_migration_target_nodes(void)
> -{
> -	nodemask_t next_pass;
> -	nodemask_t this_pass;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> -	int node, best_distance;
> -
> -	/*
> -	 * Avoid any oddities like cycles that could occur
> -	 * from changes in the topology.  This will leave
> -	 * a momentary gap when migration is disabled.
> -	 */
> -	disable_all_migrate_targets();
> -
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> -
> -	for_each_node_mask(node, this_pass) {
> -		best_distance = -1;
> -
> -		/*
> -		 * Try to set up the migration path for the node, and the target
> -		 * migration nodes can be multiple, so doing a loop to find all
> -		 * the target nodes if they all have a best node distance.
> -		 */
> -		do {
> -			int target_node =
> -				establish_migrate_target(node, &used_targets,
> -							 best_distance);
> -
> -			if (target_node == NUMA_NO_NODE)
> -				break;
> -
> -			if (best_distance == -1)
> -				best_distance = node_distance(node, target_node);
> -
> -			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> -			 */
> -			node_set(target_node, next_pass);
> -		} while (1);
> -	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> -}
> -
> -/*
> - * For callers that do not hold get_online_mems() already.
> - */
> -void set_migration_target_nodes(void)
> -{
> -	get_online_mems();
> -	__set_migration_target_nodes();
> -	put_online_mems();
> -}
> -
> -/*
> - * This leaves migrate-on-reclaim transiently disabled between
> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
> - * whether reclaim-based migration is enabled or not, which
> - * ensures that the user can turn reclaim-based migration at
> - * any time without needing to recalculate migration targets.
> - *
> - * These callbacks already hold get_online_mems().  That is why
> - * __set_migration_target_nodes() can be used as opposed to
> - * set_migration_target_nodes().
> - */
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> -						 unsigned long action, void *_arg)
> -{
> -	struct memory_notify *arg = _arg;
> -
> -	/*
> -	 * Only update the node migration order when a node is
> -	 * changing status, like online->offline.  This avoids
> -	 * the overhead of synchronize_rcu() in most cases.
> -	 */
> -	if (arg->status_change_nid < 0)
> -		return notifier_from_errno(0);
> -
> -	switch (action) {
> -	case MEM_GOING_OFFLINE:
> -		/*
> -		 * Make sure there are not transient states where
> -		 * an offline node is a migration target.  This
> -		 * will leave migration disabled until the offline
> -		 * completes and the MEM_OFFLINE case below runs.
> -		 */
> -		disable_all_migrate_targets();
> -		break;
> -	case MEM_OFFLINE:
> -	case MEM_ONLINE:
> -		/*
> -		 * Recalculate the target nodes once the node
> -		 * reaches its final state (online or offline).
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_CANCEL_OFFLINE:
> -		/*
> -		 * MEM_GOING_OFFLINE disabled all the migration
> -		 * targets.  Reenable them.
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_GOING_ONLINE:
> -	case MEM_CANCEL_ONLINE:
> -		break;
> -	}
> -
> -	return notifier_from_errno(0);
> -}
> -#endif
> -
> -void __init migrate_on_reclaim_init(void)
> -{
> -	node_demotion = kcalloc(nr_node_ids,
> -				sizeof(struct demotion_nodes),
> -				GFP_KERNEL);
> -	WARN_ON(!node_demotion);
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> -#endif
> -	/*
> -	 * At this point, all numa nodes with memory/CPus have their state
> -	 * properly set, so we can build the demotion order now.
> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
> -	 * CPU hotplug events during boot.
> -	 */
> -	cpus_read_lock();
> -	set_migration_target_nodes();
> -	cpus_read_unlock();
> -}
>  #endif /* CONFIG_NUMA */
> -
> -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index da525bfb6f4a..835e3c028f35 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -28,7 +28,6 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> -#include <linux/migrate.h>
>  
> 
> 
> 
> 
> 
> 
> 
>  #include "internal.h"
>  
> 
> 
> 
> 
> 
> 
> 
> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>  
> 
> 
> 
> 
> 
> 
> 
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> -		set_migration_target_nodes();
>  	}
>  
> 
> 
> 
> 
> 
> 
> 
>  	return 0;
> @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  		return 0;
>  
> 
> 
> 
> 
> 
> 
> 
>  	node_clear_state(node, N_CPU);
> -	set_migration_target_nodes();
>  
> 
> 
> 
> 
> 
> 
> 
>  	return 0;
>  }
> @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
>  
> 
> 
> 
> 
> 
> 
> 
>  	start_shepherd_timer();
>  #endif
> -	migrate_on_reclaim_init();
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
>  	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-07 22:51   ` Tim Chen
  2022-06-08  5:02     ` Aneesh Kumar K V
@ 2022-06-08  6:52     ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:52 UTC (permalink / raw)
  To: Tim Chen, Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Tue, 2022-06-07 at 15:51 -0700, Tim Chen wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > 
> > +int next_demotion_node(int node)
> > +{
> > +	struct demotion_nodes *nd;
> > +	int target, nnodes, i;
> > +
> > +	if (!node_demotion)
> > +		return NUMA_NO_NODE;
> > +
> > +	nd = &node_demotion[node];
> > +
> > +	/*
> > +	 * node_demotion[] is updated without excluding this
> > +	 * function from running.
> > +	 *
> > +	 * Make sure to use RCU over entire code blocks if
> > +	 * node_demotion[] reads need to be consistent.
> > +	 */
> > +	rcu_read_lock();
> > +
> > +	nnodes = nodes_weight(nd->preferred);
> > +	if (!nnodes)
> > +		return NUMA_NO_NODE;
> > +
> > +	/*
> > +	 * If there are multiple target nodes, just select one
> > +	 * target node randomly.
> > +	 *
> > +	 * In addition, we can also use round-robin to select
> > +	 * target node, but we should introduce another variable
> > +	 * for node_demotion[] to record last selected target node,
> > +	 * that may cause cache ping-pong due to the changing of
> > +	 * last target node. Or introducing per-cpu data to avoid
> > +	 * caching issue, which seems more complicated. So selecting
> > +	 * target node randomly seems better until now.
> > +	 */
> > +	nnodes = get_random_int() % nnodes;
> > +	target = first_node(nd->preferred);
> > +	for (i = 0; i < nnodes; i++)
> > +		target = next_node(target, nd->preferred);
> 
> We can simplify the above 4 lines.
> 
> 	target = node_random(nd->preferred);
> 
> There's still a loop overhead though :(

To avoid loop overhead, we can use the original implementation of
next_demotion_node.  The performance is much better for the most common
cases, the number of preferred node is 1.

Best Regards,
Huang, Ying

> > 
> > 

> > +
> > +	rcu_read_unlock();
> > +
> > +	return target;
> > +}
> > +
> > 
> > + */
> > +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> > +						 unsigned long action, void *_arg)
> > +{
> > +	struct memory_notify *arg = _arg;
> > +
> > +	/*
> > +	 * Only update the node migration order when a node is
> > +	 * changing status, like online->offline.
> > +	 */
> > +	if (arg->status_change_nid < 0)
> > +		return notifier_from_errno(0);
> > +
> > +	switch (action) {
> > +	case MEM_OFFLINE:
> > +		/*
> > +		 * In case we are moving out of N_MEMORY. Keep the node
> > +		 * in the memory tier so that when we bring memory online,
> > +		 * they appear in the right memory tier. We still need
> > +		 * to rebuild the demotion order.
> > +		 */
> > +		mutex_lock(&memory_tier_lock);
> > +		establish_migration_targets();
> > +		mutex_unlock(&memory_tier_lock);
> > +		break;
> > +	case MEM_ONLINE:
> > +		/*
> > +		 * We ignore the error here, if the node already have the tier
> > +		 * registered, we will continue to use that for the new memory
> > +		 * we are adding here.
> > +		 */
> > +		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
> 
> Should establish_migration_targets() be run here? Otherwise what are the
> demotion targets for this newly onlined node?
> 
> > +		break;
> > +	}
> > +
> > +	return notifier_from_errno(0);
> > +}
> > +
> 
> Tim
> 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
  2022-06-07 23:40   ` Tim Chen
@ 2022-06-08  6:59   ` Ying Huang
  2022-06-08  8:20     ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  6:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> This patch adds the special string "none" as a supported memtier value
> that we can use to remove a specific node from being using as demotion target.
> 
> For ex:
> :/sys/devices/system/node/node1# cat memtier
> 1
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 1-3
> :/sys/devices/system/node/node1# echo none > memtier
> :/sys/devices/system/node/node1#
> :/sys/devices/system/node/node1# cat memtier
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 2-3
> :/sys/devices/system/node/node1#

Do you have a practical use case for this?  What kind of memory node
needs to be removed from memory tiers demotion/promotion?

Best Regards,
Huang, YIng


[snip]



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-06  8:33         ` Aneesh Kumar K V
@ 2022-06-08  7:26           ` Ying Huang
  2022-06-08  8:28             ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  7:26 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 12:54 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> > > On 6/6/22 8:41 AM, Ying Huang wrote:
> > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > > > With memory tiers support we can have memory on NUMA nodes
> > > > > in the top tier from which we want to avoid promotion tracking NUMA
> > > > > faults. Update node_is_toptier to work with memory tiers. To
> > > > > avoid taking locks, a nodemask is maintained for all demotion
> > > > > targets. All NUMA nodes are by default top tier nodes and as
> > > > > we add new lower memory tiers NUMA nodes get added to the
> > > > > demotion targets thereby moving them out of the top tier.
> > > > 
> > > > Check the usage of node_is_toptier(),
> > > > 
> > > > - migrate_misplaced_page()
> > > >     node_is_toptier() is used to check whether migration is a promotion.
> > > > We can avoid to use it.  Just compare the rank of the nodes.
> > > > 
> > > > - change_pte_range() and change_huge_pmd()
> > > >     node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > > > for promotion.  So I think we should change the name to node_is_fast()
> > > > as follows,
> > > > 
> > > > static inline bool node_is_fast(int node)
> > > > {
> > > > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > > > }
> > > > 
> > > 
> > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
> > > patches, absolute value of rank doesn't carry any meaning. It is only
> > > the relative value w.r.t other memory tiers that decide whether it is
> > > fast or not. Agreed by default memory tiers get built with
> > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
> > > Hence to determine a node is consisting of fast memory is essentially
> > > figuring out whether node is the top most tier in memory hierarchy and
> > > not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> > 
> > In a system with 3 tiers,
> > 
> > HBM	0
> > DRAM	1
> > PMEM	2
> > 
> > In your implementation, only HBM will be considered fast.  But what we
> > need is to consider both HBM and DRAM fast.  Because we use NUMA
> > balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> > and DRAM pages for that.  And there're no requirements to promote DRAM
> > pages to HBM with NUMA balancing.
> > 
> > I can understand that the memory tiers are more dynamic now.  For
> > requirements of NUMA balancing, we need the lowest memory tier (rank)
> > where there's at least one node with CPU.  The nodes in it and the
> > higher tiers will be considered fast.
> > 
> 
> is this good (not tested)?
> /*
>   * build the allowed promotion mask. Promotion is allowed
>   * from higher memory tier to lower memory tier only if
>   * lower memory tier doesn't include compute. We want to
>   * skip promotion from a memory tier, if any node which is
>   * part of that memory tier have CPUs. Once we detect such
>   * a memory tier, we consider that tier as top tier from
>   * which promotion is not allowed.
>   */
> list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> 	if (nodes_empty(allowed))
> 		nodes_or(promotion_mask, promotion_mask, allowed);
> 	else
> 		break;
> }
> 
> and then
> 
> static inline bool node_is_toptier(int node)
> {
> 
> 	return !node_isset(node, promotion_mask);
> }
> 

This should work.  But it appears unnatural.  So, I don't think we
should avoid to add more and more node masks to mitigate the design
decision that we cannot access memory tier information directly.  All
these becomes simple and natural, if we can access memory tier
information directly.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
  2022-06-07 22:51   ` Tim Chen
  2022-06-08  6:50   ` Ying Huang
@ 2022-06-08  8:00   ` Ying Huang
  2 siblings, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  8:00 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> This patch switch the demotion target building logic to use memory tiers
> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
> default tier 1 and additional memory tiers will be added by drivers like
> dax kmem.
> 
> This patch builds the demotion target for a NUMA node by looking at all
> memory tiers below the tier to which the NUMA node belongs. The closest node
> in the immediately following memory tier is used as a demotion target.
> 
> Since we are now only building demotion target for N_MEMORY NUMA nodes
> the CPU hotplug calls are removed in this patch.
> 
> The rank approach allows us to keep memory tier device IDs stable even if there
> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> or lower than these nodes. A new memory tier can be inserted into the tier
> hierarchy for a new set of nodes without affecting the node assignment of any
> existing memtier, provided that there is enough gap in the rank values for the
> new memtier.
> 
> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> Its value relative to other memtiers decides the level of this memtier in the tier
> hierarchy.
> 
> For now, This patch supports hardcoded rank values which are 300, 200, & 100 for
> memory tiers 0,1 & 2 respectively.
> 
> Below is the sysfs interface to read the rank values of memory tier,
> /sys/devices/system/memtier/memtierN/rank
> 
> This interface is read only for now. Write support can be added when there is
> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> requirement among them.
> 
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |   5 +
>  include/linux/migrate.h      |  13 --
>  mm/memory-tiers.c            | 269 ++++++++++++++++++++++++
>  mm/migrate.c                 | 394 -----------------------------------
>  mm/vmstat.c                  |   4 -
>  5 files changed, 274 insertions(+), 411 deletions(-)
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 33ef36395a20..adc2cb3bf5f8 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -16,11 +16,16 @@
>  #define MAX_MEMORY_TIERS  3
>  
> 
> 
> 
>  extern bool numa_demotion_enabled;
> +int next_demotion_node(int node);
>  int node_get_memory_tier_id(int node);
>  int node_set_memory_tier(int node, int tier);
>  int node_reset_memory_tier(int node, int tier);
>  #else
>  #define numa_demotion_enabled	false
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
>  
> 
> 
> 
>  #endif	/* CONFIG_TIERED_MEMORY */
>  
> 
> 
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 43e737215f33..93fab62e6548 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  
> 
> 
> 
>  #endif /* CONFIG_MIGRATION */
>  
> 
> 
> 
> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> -extern void set_migration_target_nodes(void);
> -extern void migrate_on_reclaim_init(void);
> -extern int next_demotion_node(int node);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -static inline void migrate_on_reclaim_init(void) {}
> -static inline int next_demotion_node(int node)
> -{
> -        return NUMA_NO_NODE;
> -}
> -#endif
> -
>  #ifdef CONFIG_COMPACTION
>  extern int PageMovable(struct page *page);
>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 3f382d1f844a..0d05c0bfb79b 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -4,6 +4,10 @@
>  #include <linux/nodemask.h>
>  #include <linux/slab.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/random.h>
> +#include <linux/memory.h>
> +
> +#include "internal.h"
>  
> 
> 
> 
>  struct memory_tier {
>  	struct list_head list;
> @@ -12,6 +16,10 @@ struct memory_tier {
>  	int rank;
>  };
>  
> 
> 
> 
> +struct demotion_nodes {
> +	nodemask_t preferred;
> +};
> +
>  #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
>  
> 
> 
> 
>  static struct bus_type memory_tier_subsys = {
> @@ -19,9 +27,71 @@ static struct bus_type memory_tier_subsys = {
>  	.dev_name = "memtier",
>  };
>  
> 
> 
> 
> +static void establish_migration_targets(void);
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  
> 
> 
> 
> +/*
> + * node_demotion[] examples:
> + *
> + * Example 1:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> + *
> + * node distances:
> + * node   0    1    2    3
> + *    0  10   20   30   40
> + *    1  20   10   40   30
> + *    2  30   40   10   40
> + *    3  40   30   40   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-1
> + * memory_tiers[2] = 2-3
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 3
> + * node_demotion[2].preferred = <empty>
> + * node_demotion[3].preferred = <empty>
> + *
> + * Example 2:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   30
> + *    2  30   30   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-2
> + * memory_tiers[2] = <empty>
> + *
> + * node_demotion[0].preferred = <empty>
> + * node_demotion[1].preferred = <empty>
> + * node_demotion[2].preferred = <empty>
> + *
> + * Example 3:
> + *
> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   40
> + *    2  30   40   10
> + *
> + * memory_tiers[0] = 1
> + * memory_tiers[1] = 0
> + * memory_tiers[2] = 2
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 0
> + * node_demotion[2].preferred = <empty>
> + *
> + */
> +static struct demotion_nodes *node_demotion __read_mostly;
>  
> 
> 
> 
>  static ssize_t nodelist_show(struct device *dev,
>  			     struct device_attribute *attr, char *buf)
> @@ -202,6 +272,7 @@ static void node_remove_from_memory_tier(int node)
>  	if (nodes_empty(memtier->nodelist))
>  		unregister_memory_tier(memtier);
>  
> 
> 
> 
> +	establish_migration_targets();
>  out:
>  	mutex_unlock(&memory_tier_lock);
>  }
> @@ -263,6 +334,8 @@ int node_reset_memory_tier(int node, int tier)
>  
> 
> 
> 
>  	if (nodes_empty(current_tier->nodelist))
>  		unregister_memory_tier(current_tier);
> +
> +	establish_migration_targets();
>  out:
>  	mutex_unlock(&memory_tier_lock);
>  
> 
> 
> 
> @@ -276,13 +349,208 @@ int node_set_memory_tier(int node, int tier)
>  
> 
> 
> 
>  	mutex_lock(&memory_tier_lock);
>  	memtier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> +	 * will have skipped this node.
> +	 */
>  	if (!memtier)
>  		ret = __node_set_memory_tier(node, tier);
> +	establish_migration_targets();
> +
>  	mutex_unlock(&memory_tier_lock);
>  
> 
> 
> 
>  	return ret;
>  }
>  
> 
> 
> 
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * Return: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target, nnodes, i;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +
> +	nnodes = nodes_weight(nd->preferred);
> +	if (!nnodes)
> +		return NUMA_NO_NODE;

You forget to call rcu_read_unlock() before returning.

Best Regards,
Huang, Ying

> +
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	nnodes = get_random_int() % nnodes;
> +	target = first_node(nd->preferred);
> +	for (i = 0; i < nnodes; i++)
> +		target = next_node(target, nd->preferred);
> +
> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> +/* Disable reclaim-based migration. */
> +static void __disable_all_migrate_targets(void)
> +{
> +	int node;
> +
> +	for_each_node_mask(node, node_states[N_MEMORY])
> +		node_demotion[node].preferred = NODE_MASK_NONE;
> +}
> +
> +static void disable_all_migrate_targets(void)
> +{
> +	__disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 */
> +	synchronize_rcu();
> +}
> +
> +/*
> + * Find an automatic demotion target for all memory
> + * nodes. Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static void establish_migration_targets(void)
> +{
> +	struct memory_tier *memtier;
> +	struct demotion_nodes *nd;
> +	int target = NUMA_NO_NODE, node;
> +	int distance, best_distance;
> +	nodemask_t used;
> +
> +	if (!node_demotion)
> +		return;
> +
> +	disable_all_migrate_targets();
> +
> +	for_each_node_mask(node, node_states[N_MEMORY]) {
> +		best_distance = -1;
> +		nd = &node_demotion[node];
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
> +			continue;
> +		/*
> +		 * Get the next memtier to find the  demotion node list.
> +		 */
> +		memtier = list_next_entry(memtier, list);
> +
> +		/*
> +		 * find_next_best_node, use 'used' nodemask as a skip list.
> +		 * Add all memory nodes except the selected memory tier
> +		 * nodelist to skip list so that we find the best node from the
> +		 * memtier nodelist.
> +		 */
> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
> +
> +		/*
> +		 * Find all the nodes in the memory tier node list of same best distance.
> +		 * add them to the preferred mask. We randomly select between nodes
> +		 * in the preferred mask when allocating pages during demotion.
> +		 */
> +		do {
> +			target = find_next_best_node(node, &used);
> +			if (target == NUMA_NO_NODE)
> +				break;
> +
> +			distance = node_distance(node, target);
> +			if (distance == best_distance || best_distance == -1) {
> +				best_distance = distance;
> +				node_set(target, nd->preferred);
> +			} else {
> +				break;
> +			}
> +		} while (1);
> +	}
> +}
> +
> +/*
> + * This runs whether reclaim-based migration is enabled or not,
> + * which ensures that the user can turn reclaim-based migration
> + * at any time without needing to recalculate migration targets.
> + */
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *_arg)
> +{
> +	struct memory_notify *arg = _arg;
> +
> +	/*
> +	 * Only update the node migration order when a node is
> +	 * changing status, like online->offline.
> +	 */
> +	if (arg->status_change_nid < 0)
> +		return notifier_from_errno(0);
> +
> +	switch (action) {
> +	case MEM_OFFLINE:
> +		/*
> +		 * In case we are moving out of N_MEMORY. Keep the node
> +		 * in the memory tier so that when we bring memory online,
> +		 * they appear in the right memory tier. We still need
> +		 * to rebuild the demotion order.
> +		 */
> +		mutex_lock(&memory_tier_lock);
> +		establish_migration_targets();
> +		mutex_unlock(&memory_tier_lock);
> +		break;
> +	case MEM_ONLINE:
> +		/*
> +		 * We ignore the error here, if the node already have the tier
> +		 * registered, we will continue to use that for the new memory
> +		 * we are adding here.
> +		 */
> +		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
> +}
> +
> +static void __init migrate_on_reclaim_init(void)
> +{
> +	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
> +				GFP_KERNEL);
> +	WARN_ON(!node_demotion);
> +
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> +}
> +
>  static int __init memory_tier_init(void)
>  {
>  	int ret;
> @@ -302,6 +570,7 @@ static int __init memory_tier_init(void)
>  
> 
> 
> 
>  	/* CPU only nodes are not part of memory tiers. */
>  	memtier->nodelist = node_states[N_MEMORY];
> +	migrate_on_reclaim_init();
>  
> 
> 
> 
>  	return 0;
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 29cacc217e38..0b554625a219 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2116,398 +2116,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -/*
> - * node_demotion[] example:
> - *
> - * Consider a system with two sockets.  Each socket has
> - * three classes of memory attached: fast, medium and slow.
> - * Each memory class is placed in its own NUMA node.  The
> - * CPUs are placed in the node with the "fast" memory.  The
> - * 6 NUMA nodes (0-5) might be split among the sockets like
> - * this:
> - *
> - *	Socket A: 0, 1, 2
> - *	Socket B: 3, 4, 5
> - *
> - * When Node 0 fills up, its memory should be migrated to
> - * Node 1.  When Node 1 fills up, it should be migrated to
> - * Node 2.  The migration path start on the nodes with the
> - * processors (since allocations default to this node) and
> - * fast memory, progress through medium and end with the
> - * slow memory:
> - *
> - *	0 -> 1 -> 2 -> stop
> - *	3 -> 4 -> 5 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *
> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
> - *
> - * Moreover some systems may have multiple slow memory nodes.
> - * Suppose a system has one socket with 3 memory nodes, node 0
> - * is fast memory type, and node 1/2 both are slow memory
> - * type, and the distance between fast memory node and slow
> - * memory node is same. So the migration path should be:
> - *
> - *	0 -> 1/2 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
> - */
> -
> -/*
> - * Writes to this array occur without locking.  Cycles are
> - * not allowed: Node X demotes to Y which demotes to X...
> - *
> - * If multiple reads are performed, a single rcu_read_lock()
> - * must be held over all reads to ensure that no cycles are
> - * observed.
> - */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -	unsigned short nr;
> -	short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> -
> -/**
> - * next_demotion_node() - Get the next node in the demotion path
> - * @node: The starting node to lookup the next node
> - *
> - * Return: node id for next memory node in the demotion path hierarchy
> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> - * @node online or guarantee that it *continues* to be the next demotion
> - * target.
> - */
> -int next_demotion_node(int node)
> -{
> -	struct demotion_nodes *nd;
> -	unsigned short target_nr, index;
> -	int target;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	/*
> -	 * node_demotion[] is updated without excluding this
> -	 * function from running.  RCU doesn't provide any
> -	 * compiler barriers, so the READ_ONCE() is required
> -	 * to avoid compiler reordering or read merging.
> -	 *
> -	 * Make sure to use RCU over entire code blocks if
> -	 * node_demotion[] reads need to be consistent.
> -	 */
> -	rcu_read_lock();
> -	target_nr = READ_ONCE(nd->nr);
> -
> -	switch (target_nr) {
> -	case 0:
> -		target = NUMA_NO_NODE;
> -		goto out;
> -	case 1:
> -		index = 0;
> -		break;
> -	default:
> -		/*
> -		 * If there are multiple target nodes, just select one
> -		 * target node randomly.
> -		 *
> -		 * In addition, we can also use round-robin to select
> -		 * target node, but we should introduce another variable
> -		 * for node_demotion[] to record last selected target node,
> -		 * that may cause cache ping-pong due to the changing of
> -		 * last target node. Or introducing per-cpu data to avoid
> -		 * caching issue, which seems more complicated. So selecting
> -		 * target node randomly seems better until now.
> -		 */
> -		index = get_random_int() % target_nr;
> -		break;
> -	}
> -
> -	target = READ_ONCE(nd->nodes[index]);
> -
> -out:
> -	rcu_read_unlock();
> -	return target;
> -}
> -
> -/* Disable reclaim-based migration. */
> -static void __disable_all_migrate_targets(void)
> -{
> -	int node, i;
> -
> -	if (!node_demotion)
> -		return;
> -
> -	for_each_online_node(node) {
> -		node_demotion[node].nr = 0;
> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
> -	}
> -}
> -
> -static void disable_all_migrate_targets(void)
> -{
> -	__disable_all_migrate_targets();
> -
> -	/*
> -	 * Ensure that the "disable" is visible across the system.
> -	 * Readers will see either a combination of before+disable
> -	 * state or disable+after.  They will never see before and
> -	 * after state together.
> -	 *
> -	 * The before+after state together might have cycles and
> -	 * could cause readers to do things like loop until this
> -	 * function finishes.  This ensures they can only see a
> -	 * single "bad" read and would, for instance, only loop
> -	 * once.
> -	 */
> -	synchronize_rcu();
> -}
> -
> -/*
> - * Find an automatic demotion target for 'node'.
> - * Failing here is OK.  It might just indicate
> - * being at the end of a chain.
> - */
> -static int establish_migrate_target(int node, nodemask_t *used,
> -				    int best_distance)
> -{
> -	int migration_target, index, val;
> -	struct demotion_nodes *nd;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	migration_target = find_next_best_node(node, used);
> -	if (migration_target == NUMA_NO_NODE)
> -		return NUMA_NO_NODE;
> -
> -	/*
> -	 * If the node has been set a migration target node before,
> -	 * which means it's the best distance between them. Still
> -	 * check if this node can be demoted to other target nodes
> -	 * if they have a same best distance.
> -	 */
> -	if (best_distance != -1) {
> -		val = node_distance(node, migration_target);
> -		if (val > best_distance)
> -			goto out_clear;
> -	}
> -
> -	index = nd->nr;
> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
> -		      "Exceeds maximum demotion target nodes\n"))
> -		goto out_clear;
> -
> -	nd->nodes[index] = migration_target;
> -	nd->nr++;
> -
> -	return migration_target;
> -out_clear:
> -	node_clear(migration_target, *used);
> -	return NUMA_NO_NODE;
> -}
> -
> -/*
> - * When memory fills up on a node, memory contents can be
> - * automatically migrated to another node instead of
> - * discarded at reclaim.
> - *
> - * Establish a "migration path" which will start at nodes
> - * with CPUs and will follow the priorities used to build the
> - * page allocator zonelists.
> - *
> - * The difference here is that cycles must be avoided.  If
> - * node0 migrates to node1, then neither node1, nor anything
> - * node1 migrates to can migrate to node0. Also one node can
> - * be migrated to multiple nodes if the target nodes all have
> - * a same best-distance against the source node.
> - *
> - * This function can run simultaneously with readers of
> - * node_demotion[].  However, it can not run simultaneously
> - * with itself.  Exclusion is provided by memory hotplug events
> - * being single-threaded.
> - */
> -static void __set_migration_target_nodes(void)
> -{
> -	nodemask_t next_pass;
> -	nodemask_t this_pass;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> -	int node, best_distance;
> -
> -	/*
> -	 * Avoid any oddities like cycles that could occur
> -	 * from changes in the topology.  This will leave
> -	 * a momentary gap when migration is disabled.
> -	 */
> -	disable_all_migrate_targets();
> -
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> -
> -	for_each_node_mask(node, this_pass) {
> -		best_distance = -1;
> -
> -		/*
> -		 * Try to set up the migration path for the node, and the target
> -		 * migration nodes can be multiple, so doing a loop to find all
> -		 * the target nodes if they all have a best node distance.
> -		 */
> -		do {
> -			int target_node =
> -				establish_migrate_target(node, &used_targets,
> -							 best_distance);
> -
> -			if (target_node == NUMA_NO_NODE)
> -				break;
> -
> -			if (best_distance == -1)
> -				best_distance = node_distance(node, target_node);
> -
> -			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> -			 */
> -			node_set(target_node, next_pass);
> -		} while (1);
> -	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> -}
> -
> -/*
> - * For callers that do not hold get_online_mems() already.
> - */
> -void set_migration_target_nodes(void)
> -{
> -	get_online_mems();
> -	__set_migration_target_nodes();
> -	put_online_mems();
> -}
> -
> -/*
> - * This leaves migrate-on-reclaim transiently disabled between
> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
> - * whether reclaim-based migration is enabled or not, which
> - * ensures that the user can turn reclaim-based migration at
> - * any time without needing to recalculate migration targets.
> - *
> - * These callbacks already hold get_online_mems().  That is why
> - * __set_migration_target_nodes() can be used as opposed to
> - * set_migration_target_nodes().
> - */
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> -						 unsigned long action, void *_arg)
> -{
> -	struct memory_notify *arg = _arg;
> -
> -	/*
> -	 * Only update the node migration order when a node is
> -	 * changing status, like online->offline.  This avoids
> -	 * the overhead of synchronize_rcu() in most cases.
> -	 */
> -	if (arg->status_change_nid < 0)
> -		return notifier_from_errno(0);
> -
> -	switch (action) {
> -	case MEM_GOING_OFFLINE:
> -		/*
> -		 * Make sure there are not transient states where
> -		 * an offline node is a migration target.  This
> -		 * will leave migration disabled until the offline
> -		 * completes and the MEM_OFFLINE case below runs.
> -		 */
> -		disable_all_migrate_targets();
> -		break;
> -	case MEM_OFFLINE:
> -	case MEM_ONLINE:
> -		/*
> -		 * Recalculate the target nodes once the node
> -		 * reaches its final state (online or offline).
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_CANCEL_OFFLINE:
> -		/*
> -		 * MEM_GOING_OFFLINE disabled all the migration
> -		 * targets.  Reenable them.
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_GOING_ONLINE:
> -	case MEM_CANCEL_ONLINE:
> -		break;
> -	}
> -
> -	return notifier_from_errno(0);
> -}
> -#endif
> -
> -void __init migrate_on_reclaim_init(void)
> -{
> -	node_demotion = kcalloc(nr_node_ids,
> -				sizeof(struct demotion_nodes),
> -				GFP_KERNEL);
> -	WARN_ON(!node_demotion);
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> -#endif
> -	/*
> -	 * At this point, all numa nodes with memory/CPus have their state
> -	 * properly set, so we can build the demotion order now.
> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
> -	 * CPU hotplug events during boot.
> -	 */
> -	cpus_read_lock();
> -	set_migration_target_nodes();
> -	cpus_read_unlock();
> -}
>  #endif /* CONFIG_NUMA */
> -
> -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index da525bfb6f4a..835e3c028f35 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -28,7 +28,6 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> -#include <linux/migrate.h>
>  
> 
> 
> 
>  #include "internal.h"
>  
> 
> 
> 
> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>  
> 
> 
> 
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> -		set_migration_target_nodes();
>  	}
>  
> 
> 
> 
>  	return 0;
> @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  		return 0;
>  
> 
> 
> 
>  	node_clear_state(node, N_CPU);
> -	set_migration_target_nodes();
>  
> 
> 
> 
>  	return 0;
>  }
> @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
>  
> 
> 
> 
>  	start_shepherd_timer();
>  #endif
> -	migrate_on_reclaim_init();
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
>  	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  6:10       ` Ying Huang
@ 2022-06-08  8:04         ` Aneesh Kumar K V
  0 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:04 UTC (permalink / raw)
  To: Ying Huang, Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/8/22 11:40 AM, Ying Huang wrote:
> On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 12:13 AM, Tim Chen wrote:
>> ...
>>
>>>>
>>>> +
>>>> +static void memory_tier_device_release(struct device *dev)
>>>> +{
>>>> +	struct memory_tier *tier = to_memory_tier(dev);
>>>> +
>>>
>>> Do we need some ref counts on memory_tier?
>>> If there is another device still using the same memtier,
>>> free below could cause problem.
>>>
>>>> +	kfree(tier);
>>>> +}
>>>> +
>>>>
>>> ...
>>
>> The lifecycle of the memory_tier struct is tied to the sysfs device life
>> time. ie, memory_tier_device_relese get called only after the last
>> reference on that sysfs dev object is released. Hence we can be sure
>> there is no userspace that is keeping one of the memtier related sysfs
>> file open.
>>
>> W.r.t other memory device sharing the same memtier, we unregister the
>> sysfs device only when the memory tier nodelist is empty. That is no
>> memory device is present in this memory tier.
> 
> memory_tier isn't only used by user space.  It is used inside kernel
> too.  If some kernel code get a pointer to struct memory_tier, we need
> to guarantee the pointer will not be freed under us. 

As mentioned above current patchset avoid doing that.

> And as Tim pointed
> out, we need to use it in hot path (for statistics), so some kind of rcu
> lock may be good.
> 

Sure when those statistics code get added, we can add the relevant kref 
and locking details.

-aneesh



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-06-08  6:50   ` Ying Huang
@ 2022-06-08  8:19     ` Aneesh Kumar K V
  0 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:19 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/8/22 12:20 PM, Ying Huang wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>> This patch switch the demotion target building logic to use memory tiers
>> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
>> default tier 1 and additional memory tiers will be added by drivers like
>> dax kmem.
>>
>> This patch builds the demotion target for a NUMA node by looking at all
>> memory tiers below the tier to which the NUMA node belongs. The closest node
>> in the immediately following memory tier is used as a demotion target.
>>
>> Since we are now only building demotion target for N_MEMORY NUMA nodes
>> the CPU hotplug calls are removed in this patch.
>>
>> The rank approach allows us to keep memory tier device IDs stable even if there
>> is a need to change the tier ordering among different memory tiers. e.g. DRAM
>> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
>> or lower than these nodes. A new memory tier can be inserted into the tier
>> hierarchy for a new set of nodes without affecting the node assignment of any
>> existing memtier, provided that there is enough gap in the rank values for the
>> new memtier.
>>
>> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
>> Its value relative to other memtiers decides the level of this memtier in the tier
>> hierarchy.
>>
>> For now, This patch supports hardcoded rank values which are 300, 200, & 100 for
>> memory tiers 0,1 & 2 respectively.
>>
>> Below is the sysfs interface to read the rank values of memory tier,
>> /sys/devices/system/memtier/memtierN/rank
>>
>> This interface is read only for now. Write support can be added when there is
>> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
>> requirement among them.
>>
>> Suggested-by: Wei Xu <weixugc@google.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   include/linux/memory-tiers.h |   5 +
>>   include/linux/migrate.h      |  13 --
>>   mm/memory-tiers.c            | 269 ++++++++++++++++++++++++
>>   mm/migrate.c                 | 394 -----------------------------------
>>   mm/vmstat.c                  |   4 -
>>   5 files changed, 274 insertions(+), 411 deletions(-)
> 
> It appears that you moved some code from migrate.c to memory-tiers.c and
> change them.  If so, please separate the change.  That is, one patch
> only move the code, the other change the code.  This will make it easier
> to find out what is changed.

That was how it was done in earlier version. That is we did change 
establish_migration within the same file. The changes we are doing here 
was so different that it was mentioned that it gets very hard to review
in a context diff. Hence this patch where we killed the old code and did 
the new code in memory-tiers.c. I could still move the code to 
memory-tiers.c and do the changes on top of that. Infact I do have a 
patch that does similar code movement in the series. But the diff was 
not useful for an easy review.

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-08  6:59   ` Ying Huang
@ 2022-06-08  8:20     ` Aneesh Kumar K V
  2022-06-08  8:23       ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:20 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/8/22 12:29 PM, Ying Huang wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>> This patch adds the special string "none" as a supported memtier value
>> that we can use to remove a specific node from being using as demotion target.
>>
>> For ex:
>> :/sys/devices/system/node/node1# cat memtier
>> 1
>> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
>> 1-3
>> :/sys/devices/system/node/node1# echo none > memtier
>> :/sys/devices/system/node/node1#
>> :/sys/devices/system/node/node1# cat memtier
>> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
>> 2-3
>> :/sys/devices/system/node/node1#
> 
> Do you have a practical use case for this?  What kind of memory node
> needs to be removed from memory tiers demotion/promotion?
> 

This came up in our internal discussion. It was mentioned that there is 
a need to skip some slow memory nodes from participating in demotion.

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-08  8:20     ` Aneesh Kumar K V
@ 2022-06-08  8:23       ` Ying Huang
  2022-06-08  8:29         ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  8:23 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 13:50 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:29 PM, Ying Huang wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > This patch adds the special string "none" as a supported memtier value
> > > that we can use to remove a specific node from being using as demotion target.
> > > 
> > > For ex:
> > > :/sys/devices/system/node/node1# cat memtier
> > > 1
> > > :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> > > 1-3
> > > :/sys/devices/system/node/node1# echo none > memtier
> > > :/sys/devices/system/node/node1#
> > > :/sys/devices/system/node/node1# cat memtier
> > > :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> > > 2-3
> > > :/sys/devices/system/node/node1#
> > 
> > Do you have a practical use case for this?  What kind of memory node
> > needs to be removed from memory tiers demotion/promotion?
> > 
> 
> This came up in our internal discussion. It was mentioned that there is 
> a need to skip some slow memory nodes from participating in demotion.

Again, can you provide a practical use case?  Why we shouldn't demote
cold pages to these slow memory nodes?  How do we use these slow memory
node?  These slow memory node is slower than disk?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-08  7:26           ` Ying Huang
@ 2022-06-08  8:28             ` Aneesh Kumar K V
  2022-06-08  8:32               ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:28 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/8/22 12:56 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 12:54 PM, Ying Huang wrote:
>>> On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
>>>> On 6/6/22 8:41 AM, Ying Huang wrote:
>>>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>>>> With memory tiers support we can have memory on NUMA nodes
>>>>>> in the top tier from which we want to avoid promotion tracking NUMA
>>>>>> faults. Update node_is_toptier to work with memory tiers. To
>>>>>> avoid taking locks, a nodemask is maintained for all demotion
>>>>>> targets. All NUMA nodes are by default top tier nodes and as
>>>>>> we add new lower memory tiers NUMA nodes get added to the
>>>>>> demotion targets thereby moving them out of the top tier.
>>>>>
>>>>> Check the usage of node_is_toptier(),
>>>>>
>>>>> - migrate_misplaced_page()
>>>>>      node_is_toptier() is used to check whether migration is a promotion.
>>>>> We can avoid to use it.  Just compare the rank of the nodes.
>>>>>
>>>>> - change_pte_range() and change_huge_pmd()
>>>>>      node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
>>>>> for promotion.  So I think we should change the name to node_is_fast()
>>>>> as follows,
>>>>>
>>>>> static inline bool node_is_fast(int node)
>>>>> {
>>>>> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
>>>>> }
>>>>>
>>>>
>>>> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
>>>> patches, absolute value of rank doesn't carry any meaning. It is only
>>>> the relative value w.r.t other memory tiers that decide whether it is
>>>> fast or not. Agreed by default memory tiers get built with
>>>> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
>>>> Hence to determine a node is consisting of fast memory is essentially
>>>> figuring out whether node is the top most tier in memory hierarchy and
>>>> not just the memory tier rank value is >= MEMORY_RANK_DRAM?
>>>
>>> In a system with 3 tiers,
>>>
>>> HBM	0
>>> DRAM	1
>>> PMEM	2
>>>
>>> In your implementation, only HBM will be considered fast.  But what we
>>> need is to consider both HBM and DRAM fast.  Because we use NUMA
>>> balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
>>> and DRAM pages for that.  And there're no requirements to promote DRAM
>>> pages to HBM with NUMA balancing.
>>>
>>> I can understand that the memory tiers are more dynamic now.  For
>>> requirements of NUMA balancing, we need the lowest memory tier (rank)
>>> where there's at least one node with CPU.  The nodes in it and the
>>> higher tiers will be considered fast.
>>>
>>
>> is this good (not tested)?
>> /*
>>    * build the allowed promotion mask. Promotion is allowed
>>    * from higher memory tier to lower memory tier only if
>>    * lower memory tier doesn't include compute. We want to
>>    * skip promotion from a memory tier, if any node which is
>>    * part of that memory tier have CPUs. Once we detect such
>>    * a memory tier, we consider that tier as top tier from
>>    * which promotion is not allowed.
>>    */
>> list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
>> 	if (nodes_empty(allowed))
>> 		nodes_or(promotion_mask, promotion_mask, allowed);
>> 	else
>> 		break;
>> }
>>
>> and then
>>
>> static inline bool node_is_toptier(int node)
>> {
>>
>> 	return !node_isset(node, promotion_mask);
>> }
>>
> 
> This should work.  But it appears unnatural.  So, I don't think we
> should avoid to add more and more node masks to mitigate the design
> decision that we cannot access memory tier information directly.  All
> these becomes simple and natural, if we can access memory tier
> information directly.
> 

how do you derive whether node is toptier details if we have memtier 
details in pgdat?

-aneesh




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-08  8:23       ` Ying Huang
@ 2022-06-08  8:29         ` Aneesh Kumar K V
  2022-06-08  8:34           ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:29 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/8/22 1:53 PM, Ying Huang wrote:
> On Wed, 2022-06-08 at 13:50 +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 12:29 PM, Ying Huang wrote:
>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>> This patch adds the special string "none" as a supported memtier value
>>>> that we can use to remove a specific node from being using as demotion target.
>>>>
>>>> For ex:
>>>> :/sys/devices/system/node/node1# cat memtier
>>>> 1
>>>> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
>>>> 1-3
>>>> :/sys/devices/system/node/node1# echo none > memtier
>>>> :/sys/devices/system/node/node1#
>>>> :/sys/devices/system/node/node1# cat memtier
>>>> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
>>>> 2-3
>>>> :/sys/devices/system/node/node1#
>>>
>>> Do you have a practical use case for this?  What kind of memory node
>>> needs to be removed from memory tiers demotion/promotion?
>>>
>>
>> This came up in our internal discussion. It was mentioned that there is
>> a need to skip some slow memory nodes from participating in demotion.
> 
> Again, can you provide a practical use case?  Why we shouldn't demote
> cold pages to these slow memory nodes?  How do we use these slow memory
> node?  These slow memory node is slower than disk?
> 

This was discussed in the context of memory borrowed from remote machine 
(aka OpenCAPI memory). In such case, we would have a memory only NUMA 
node which we want to avoid using for demotion.

-aneesh


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-08  8:28             ` Aneesh Kumar K V
@ 2022-06-08  8:32               ` Ying Huang
  2022-06-08 14:37                 ` Aneesh Kumar K.V
  0 siblings, 1 reply; 84+ messages in thread
From: Ying Huang @ 2022-06-08  8:32 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 13:58 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:56 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
> > > On 6/6/22 12:54 PM, Ying Huang wrote:
> > > > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> > > > > On 6/6/22 8:41 AM, Ying Huang wrote:
> > > > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > > > > > With memory tiers support we can have memory on NUMA nodes
> > > > > > > in the top tier from which we want to avoid promotion tracking NUMA
> > > > > > > faults. Update node_is_toptier to work with memory tiers. To
> > > > > > > avoid taking locks, a nodemask is maintained for all demotion
> > > > > > > targets. All NUMA nodes are by default top tier nodes and as
> > > > > > > we add new lower memory tiers NUMA nodes get added to the
> > > > > > > demotion targets thereby moving them out of the top tier.
> > > > > > 
> > > > > > Check the usage of node_is_toptier(),
> > > > > > 
> > > > > > - migrate_misplaced_page()
> > > > > >      node_is_toptier() is used to check whether migration is a promotion.
> > > > > > We can avoid to use it.  Just compare the rank of the nodes.
> > > > > > 
> > > > > > - change_pte_range() and change_huge_pmd()
> > > > > >      node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > > > > > for promotion.  So I think we should change the name to node_is_fast()
> > > > > > as follows,
> > > > > > 
> > > > > > static inline bool node_is_fast(int node)
> > > > > > {
> > > > > > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > > > > > }
> > > > > > 
> > > > > 
> > > > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
> > > > > patches, absolute value of rank doesn't carry any meaning. It is only
> > > > > the relative value w.r.t other memory tiers that decide whether it is
> > > > > fast or not. Agreed by default memory tiers get built with
> > > > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
> > > > > Hence to determine a node is consisting of fast memory is essentially
> > > > > figuring out whether node is the top most tier in memory hierarchy and
> > > > > not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> > > > 
> > > > In a system with 3 tiers,
> > > > 
> > > > HBM	0
> > > > DRAM	1
> > > > PMEM	2
> > > > 
> > > > In your implementation, only HBM will be considered fast.  But what we
> > > > need is to consider both HBM and DRAM fast.  Because we use NUMA
> > > > balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> > > > and DRAM pages for that.  And there're no requirements to promote DRAM
> > > > pages to HBM with NUMA balancing.
> > > > 
> > > > I can understand that the memory tiers are more dynamic now.  For
> > > > requirements of NUMA balancing, we need the lowest memory tier (rank)
> > > > where there's at least one node with CPU.  The nodes in it and the
> > > > higher tiers will be considered fast.
> > > > 
> > > 
> > > is this good (not tested)?
> > > /*
> > >    * build the allowed promotion mask. Promotion is allowed
> > >    * from higher memory tier to lower memory tier only if
> > >    * lower memory tier doesn't include compute. We want to
> > >    * skip promotion from a memory tier, if any node which is
> > >    * part of that memory tier have CPUs. Once we detect such
> > >    * a memory tier, we consider that tier as top tier from
> > >    * which promotion is not allowed.
> > >    */
> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> > > 	if (nodes_empty(allowed))
> > > 		nodes_or(promotion_mask, promotion_mask, allowed);
> > > 	else
> > > 		break;
> > > }
> > > 
> > > and then
> > > 
> > > static inline bool node_is_toptier(int node)
> > > {
> > > 
> > > 	return !node_isset(node, promotion_mask);
> > > }
> > > 
> > 
> > This should work.  But it appears unnatural.  So, I don't think we
> > should avoid to add more and more node masks to mitigate the design
> > decision that we cannot access memory tier information directly.  All
> > these becomes simple and natural, if we can access memory tier
> > information directly.
> > 
> 
> how do you derive whether node is toptier details if we have memtier 
> details in pgdat?

pgdat -> memory tier -> rank

Then we can compare this rank with the fast memory rank.  The fast
memory rank can be calculated dynamically at appropriate places.

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers
  2022-06-08  8:29         ` Aneesh Kumar K V
@ 2022-06-08  8:34           ` Ying Huang
  0 siblings, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-08  8:34 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 13:59 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 1:53 PM, Ying Huang wrote:
> > On Wed, 2022-06-08 at 13:50 +0530, Aneesh Kumar K V wrote:
> > > On 6/8/22 12:29 PM, Ying Huang wrote:
> > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > > > This patch adds the special string "none" as a supported memtier value
> > > > > that we can use to remove a specific node from being using as demotion target.
> > > > > 
> > > > > For ex:
> > > > > :/sys/devices/system/node/node1# cat memtier
> > > > > 1
> > > > > :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> > > > > 1-3
> > > > > :/sys/devices/system/node/node1# echo none > memtier
> > > > > :/sys/devices/system/node/node1#
> > > > > :/sys/devices/system/node/node1# cat memtier
> > > > > :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> > > > > 2-3
> > > > > :/sys/devices/system/node/node1#
> > > > 
> > > > Do you have a practical use case for this?  What kind of memory node
> > > > needs to be removed from memory tiers demotion/promotion?
> > > > 
> > > 
> > > This came up in our internal discussion. It was mentioned that there is
> > > a need to skip some slow memory nodes from participating in demotion.
> > 
> > Again, can you provide a practical use case?  Why we shouldn't demote
> > cold pages to these slow memory nodes?  How do we use these slow memory
> > node?  These slow memory node is slower than disk?
> > 
> 
> This was discussed in the context of memory borrowed from remote machine 
> (aka OpenCAPI memory). In such case, we would have a memory only NUMA 
> node which we want to avoid using for demotion.

Thanks for your information.  But why shouldn't we use them for
demotion?  Because it's too slow?  Even slower than disks?  Or some
other reason?

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 0/9] mm/demotion: Memory tiers and demotion
  2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (9 preceding siblings ...)
  2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
@ 2022-06-08 13:57 ` Johannes Weiner
  2022-06-08 14:20   ` Aneesh Kumar K V
  10 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2022-06-08 13:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

Hi Aneesh,

On Fri, Jun 03, 2022 at 07:12:28PM +0530, Aneesh Kumar K.V wrote:
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.

I have to disagree with this premise. The CXL.mem bus has different
latency and bandwidth characteristics. It's also conceivable that
cheaper and slower DRAM is connected to the CXL bus (think recycling
DDR4 DIMMS after switching to DDR5). DRAM != DRAM.

Our experiments with production workloads show regressions between
15-30% in serviced requests when you don't distinguish toptier DRAM
from lower tier DRAM. While it's fixable with manual tuning, your
patches would bring reintroduce this regression it seems.

Making tiers explicit is a good idea, but can we keep the current
default that CPU-less nodes are of a lower tier than ones with CPU?
I'm having a hard time imagining where this wouldn't be true... Or why
it shouldn't be those esoteric cases that need the manual tuning.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-06-07 18:43   ` Tim Chen
  2022-06-07 21:32   ` Yang Shi
@ 2022-06-08 14:11   ` Johannes Weiner
  2022-06-08 14:21     ` Aneesh Kumar K V
  2022-06-08 15:55     ` Johannes Weiner
  2 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2022-06-08 14:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

Hi Aneesh,

On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +#define MEMORY_TIER_HBM_GPU	0
> +#define MEMORY_TIER_DRAM	1
> +#define MEMORY_TIER_PMEM	2
> +
> +#define MEMORY_RANK_HBM_GPU	300
> +#define MEMORY_RANK_DRAM	200
> +#define MEMORY_RANK_PMEM	100
> +
> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIERS  3

I understand the names are somewhat arbitrary, and the tier ID space
can be expanded down the line by bumping MAX_MEMORY_TIERS.

But starting out with a packed ID space can get quite awkward for
users when new tiers - especially intermediate tiers - show up in
existing configurations. I mentioned in the other email that DRAM !=
DRAM, so new tiers seem inevitable already.

It could make sense to start with a bigger address space and spread
out the list of kernel default tiers a bit within it:

MEMORY_TIER_GPU		0
MEMORY_TIER_DRAM	10
MEMORY_TIER_PMEM	20

etc.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 0/9] mm/demotion: Memory tiers and demotion
  2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
@ 2022-06-08 14:20   ` Aneesh Kumar K V
  2022-06-09  8:53     ` Jonathan Cameron
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08 14:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 7:27 PM, Johannes Weiner wrote:
> Hi Aneesh,
> 
> On Fri, Jun 03, 2022 at 07:12:28PM +0530, Aneesh Kumar K.V wrote:
>> * The current tier initialization code always initializes
>>    each memory-only NUMA node into a lower tier.  But a memory-only
>>    NUMA node may have a high performance memory device (e.g. a DRAM
>>    device attached via CXL.mem or a DRAM-backed memory-only node on
>>    a virtual machine) and should be put into a higher tier.
> 
> I have to disagree with this premise. The CXL.mem bus has different
> latency and bandwidth characteristics. It's also conceivable that
> cheaper and slower DRAM is connected to the CXL bus (think recycling
> DDR4 DIMMS after switching to DDR5). DRAM != DRAM.
> 
> Our experiments with production workloads show regressions between
> 15-30% in serviced requests when you don't distinguish toptier DRAM
> from lower tier DRAM. While it's fixable with manual tuning, your
> patches would bring reintroduce this regression it seems.
> 
> Making tiers explicit is a good idea, but can we keep the current
> default that CPU-less nodes are of a lower tier than ones with CPU?
> I'm having a hard time imagining where this wouldn't be true... Or why
> it shouldn't be those esoteric cases that need the manual tuning.

This was mostly driven by virtual machine configs where we can find 
memory only NUMA nodes depending on the resource availability in the 
hypervisor.

Will these CXL devices be initialized by a driver? For example, if they 
are going to be initialized via dax kmem, we already consider them lower 
memory tier as with this patch series.

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 14:11   ` Johannes Weiner
@ 2022-06-08 14:21     ` Aneesh Kumar K V
  2022-06-08 15:55     ` Johannes Weiner
  1 sibling, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 7:41 PM, Johannes Weiner wrote:
> Hi Aneesh,
> 
> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_TIERED_MEMORY
>> +
>> +#define MEMORY_TIER_HBM_GPU	0
>> +#define MEMORY_TIER_DRAM	1
>> +#define MEMORY_TIER_PMEM	2
>> +
>> +#define MEMORY_RANK_HBM_GPU	300
>> +#define MEMORY_RANK_DRAM	200
>> +#define MEMORY_RANK_PMEM	100
>> +
>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIERS  3
> 
> I understand the names are somewhat arbitrary, and the tier ID space
> can be expanded down the line by bumping MAX_MEMORY_TIERS.
> 
> But starting out with a packed ID space can get quite awkward for
> users when new tiers - especially intermediate tiers - show up in
> existing configurations. I mentioned in the other email that DRAM !=
> DRAM, so new tiers seem inevitable already.
> 
> It could make sense to start with a bigger address space and spread
> out the list of kernel default tiers a bit within it:
> 
> MEMORY_TIER_GPU		0
> MEMORY_TIER_DRAM	10
> MEMORY_TIER_PMEM	20
> 

the tier index or tier id or the tier dev id don't have any special 
meaning. What is used to find the demotion order is memory tier rank and 
they are really spread out, (300, 200, 100).

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-08  8:32               ` Ying Huang
@ 2022-06-08 14:37                 ` Aneesh Kumar K.V
  2022-06-08 20:14                   ` Tim Chen
  2022-06-10  6:04                   ` Ying Huang
  0 siblings, 2 replies; 84+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-08 14:37 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

Ying Huang <ying.huang@intel.com> writes:

....

> > > 
>> > > is this good (not tested)?
>> > > /*
>> > >    * build the allowed promotion mask. Promotion is allowed
>> > >    * from higher memory tier to lower memory tier only if
>> > >    * lower memory tier doesn't include compute. We want to
>> > >    * skip promotion from a memory tier, if any node which is
>> > >    * part of that memory tier have CPUs. Once we detect such
>> > >    * a memory tier, we consider that tier as top tier from
>> > >    * which promotion is not allowed.
>> > >    */
>> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
>> > > 	if (nodes_empty(allowed))
>> > > 		nodes_or(promotion_mask, promotion_mask, allowed);
>> > > 	else
>> > > 		break;
>> > > }
>> > > 
>> > > and then
>> > > 
>> > > static inline bool node_is_toptier(int node)
>> > > {
>> > > 
>> > > 	return !node_isset(node, promotion_mask);
>> > > }
>> > > 
>> > 
>> > This should work.  But it appears unnatural.  So, I don't think we
>> > should avoid to add more and more node masks to mitigate the design
>> > decision that we cannot access memory tier information directly.  All
>> > these becomes simple and natural, if we can access memory tier
>> > information directly.
>> > 
>> 
>> how do you derive whether node is toptier details if we have memtier 
>> details in pgdat?
>
> pgdat -> memory tier -> rank
>
> Then we can compare this rank with the fast memory rank.  The fast
> memory rank can be calculated dynamically at appropriate places.

This is what I am testing now. We still need to closely audit that lock
free access to the NODE_DATA()->memtier. For v6 I will keep this as a
separate patch and once we all agree that it is safe, I will fold it
back.

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a388a806b61a..3e733de1a8a0 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -17,7 +17,6 @@
 #define MAX_MEMORY_TIERS  (MAX_STATIC_MEMORY_TIERS + 2)
 
 extern bool numa_demotion_enabled;
-extern nodemask_t promotion_mask;
 int node_create_and_set_memory_tier(int node, int tier);
 int next_demotion_node(int node);
 int node_set_memory_tier(int node, int tier);
@@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node);
 int node_reset_memory_tier(int node, int tier);
 void node_remove_from_memory_tier(int node);
 void node_get_allowed_targets(int node, nodemask_t *targets);
-
-/*
- * By default all nodes are top tiper. As we create new memory tiers
- * we below top tiers we add them to NON_TOP_TIER state.
- */
-static inline bool node_is_toptier(int node)
-{
-	return !node_isset(node, promotion_mask);
-}
+bool node_is_toptier(int node);
 
 #else
 #define numa_demotion_enabled	false
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..c4fcfd2b9980 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_TIERED_MEMORY
+	struct memory_tier *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 29a038bb38b0..31ef0fab5f19 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -7,6 +7,7 @@
 #include <linux/random.h>
 #include <linux/memory.h>
 #include <linux/idr.h>
+#include <linux/rcupdate.h>
 
 #include "internal.h"
 
@@ -26,7 +27,7 @@ struct demotion_nodes {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
-nodemask_t promotion_mask;
+static int top_tier_rank;
 /*
  * node_demotion[] examples:
  *
@@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev)
 	if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS)
 		ida_free(&memtier_dev_id, tier->dev.id);
 
-	kfree(tier);
+	kfree_rcu(tier);
 }
 
 /*
@@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
+/*
+ * Called with memory_tier_lock. Hence the device references cannot
+ * be dropped during this function.
+ */
+static void memtier_node_clear(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	/*
+	 * Make sure read side see the NULL value before we clear the node
+	 * from the nodelist.
+	 */
+	synchronize_rcu();
+	node_clear(node, memtier->nodelist);
+}
+
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	synchronize_rcu();
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->rank >= top_tier_rank)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 static int __node_create_and_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
@@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
 			goto out;
 		}
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_create_and_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier)
 		ret = -EINVAL;
 		goto out;
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node)
 	 * empty then unregister it to make it invisible
 	 * in sysfs.
 	 */
-	node_clear(node, memtier->nodelist);
+	memtier_node_clear(node, memtier);
 	if (nodes_empty(memtier->nodelist))
 		unregister_memory_tier(memtier);
 
@@ -570,15 +635,13 @@ static void establish_migration_targets(void)
 	 * a memory tier, we consider that tier as top tiper from
 	 * which promotion is not allowed.
 	 */
-	promotion_mask = NODE_MASK_NONE;
 	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
 		nodes_and(allowed, node_states[N_CPU], memtier->nodelist);
-		if (nodes_empty(allowed))
-			nodes_or(promotion_mask, promotion_mask, memtier->nodelist);
-		else
+		if (!nodes_empty(allowed)) {
+			top_tier_rank = memtier->rank;
 			break;
+		}
 	}
-
 	pr_emerg("top tier rank is %d\n", top_tier_rank);
 	allowed = NODE_MASK_NONE;
 	/*
@@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 
 static int __init memory_tier_init(void)
 {
-	int ret;
+	int ret, node;
 	struct memory_tier *memtier;
 
 	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
@@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
 		panic("%s() failed to register memory tier: %d\n", __func__, ret);
 
 	/* CPU only nodes are not part of memory tiers. */
-	memtier->nodelist = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY) {
+		/*
+		 * Should be safe to do this early in the boot.
+		 */
+		NODE_DATA(node)->memtier = memtier;
+		node_set(node, memtier->nodelist);
+	}
 	migrate_on_reclaim_init();
 
 	return 0;

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 14:11   ` Johannes Weiner
  2022-06-08 14:21     ` Aneesh Kumar K V
@ 2022-06-08 15:55     ` Johannes Weiner
  2022-06-08 16:13       ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2022-06-08 15:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

Hello,

On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > @@ -0,0 +1,20 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_TIERED_MEMORY
> > +
> > +#define MEMORY_TIER_HBM_GPU	0
> > +#define MEMORY_TIER_DRAM	1
> > +#define MEMORY_TIER_PMEM	2
> > +
> > +#define MEMORY_RANK_HBM_GPU	300
> > +#define MEMORY_RANK_DRAM	200
> > +#define MEMORY_RANK_PMEM	100
> > +
> > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIERS  3
> 
> I understand the names are somewhat arbitrary, and the tier ID space
> can be expanded down the line by bumping MAX_MEMORY_TIERS.
> 
> But starting out with a packed ID space can get quite awkward for
> users when new tiers - especially intermediate tiers - show up in
> existing configurations. I mentioned in the other email that DRAM !=
> DRAM, so new tiers seem inevitable already.
> 
> It could make sense to start with a bigger address space and spread
> out the list of kernel default tiers a bit within it:
> 
> MEMORY_TIER_GPU		0
> MEMORY_TIER_DRAM	10
> MEMORY_TIER_PMEM	20

Forgive me if I'm asking a question that has been answered. I went
back to earlier threads and couldn't work it out - maybe there were
some off-list discussions? Anyway...

Why is there a distinction between tier ID and rank? I undestand that
rank was added because tier IDs were too few. But if rank determines
ordering, what is the use of a separate tier ID? IOW, why not make the
tier ID space wider and have the kernel pick a few spread out defaults
based on known hardware, with plenty of headroom to be future proof.

  $ ls tiers
  100				# DEFAULT_TIER
  $ cat tiers/100/nodelist
  0-1				# conventional numa nodes

  <pmem is onlined>

  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1	# conventional numa
  tiers/200/nodelist:2		# pmem

  $ grep . nodes/*/tier
  nodes/0/tier:100
  nodes/1/tier:100
  nodes/2/tier:200

  <unknown device is online as node 3, defaults to 100>

  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1,3
  tiers/200/nodelist:2

  $ echo 300 >nodes/3/tier
  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1
  tiers/200/nodelist:2
  tiers/300/nodelist:3

  $ echo 200 >nodes/3/tier
  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1	
  tiers/200/nodelist:2-3

etc.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-08  4:55     ` Aneesh Kumar K V
  2022-06-08  6:42       ` Ying Huang
@ 2022-06-08 16:06       ` Tim Chen
  2022-06-08 16:15         ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Tim Chen @ 2022-06-08 16:06 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 2022-06-08 at 10:25 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 1:45 AM, Tim Chen wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > >   
> > > +static struct memory_tier *__node_get_memory_tier(int node)
> > > +{
> > > +	struct memory_tier *memtier;
> > > +
> > > +	list_for_each_entry(memtier, &memory_tiers, list) {
> > 
> > We could need to map node to mem_tier quite often, if we need
> > to account memory usage at tier level.  It will be more efficient
> > to have a pointer from node (pgdat) to memtier rather
> > than doing a search through the list.
> > 
> > 
> 
> That is something I was actively trying to avoid. Currently all struct 
> memory_tier references are with memory_tier_lock mutex held. That 
> simplify the locking and reference counting.
> 
> As of now we are able to implement all the required interfaces without 
> pgdat having pointers to struct memory_tier. We can update pgdat with 
> memtier details when we are implementing changes requiring those. We 
> could keep additional memtier->dev reference to make sure memory tiers 
> are not destroyed while other part of the kernel is referencing the 
> same. But IMHO such changes should wait till we have users for the same.
> 

I think we should have an efficient mapping from node to memtier from
the get go.  There are many easily envisioned scenarios where
we need to map from node to memtier, which Ying pointed out.

Tim


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 15:55     ` Johannes Weiner
@ 2022-06-08 16:13       ` Aneesh Kumar K V
  2022-06-08 18:16         ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08 16:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 9:25 PM, Johannes Weiner wrote:
> Hello,
> 
> On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
>> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>>> @@ -0,0 +1,20 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>> +#define _LINUX_MEMORY_TIERS_H
>>> +
>>> +#ifdef CONFIG_TIERED_MEMORY
>>> +
>>> +#define MEMORY_TIER_HBM_GPU	0
>>> +#define MEMORY_TIER_DRAM	1
>>> +#define MEMORY_TIER_PMEM	2
>>> +
>>> +#define MEMORY_RANK_HBM_GPU	300
>>> +#define MEMORY_RANK_DRAM	200
>>> +#define MEMORY_RANK_PMEM	100
>>> +
>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>> +#define MAX_MEMORY_TIERS  3
>>
>> I understand the names are somewhat arbitrary, and the tier ID space
>> can be expanded down the line by bumping MAX_MEMORY_TIERS.
>>
>> But starting out with a packed ID space can get quite awkward for
>> users when new tiers - especially intermediate tiers - show up in
>> existing configurations. I mentioned in the other email that DRAM !=
>> DRAM, so new tiers seem inevitable already.
>>
>> It could make sense to start with a bigger address space and spread
>> out the list of kernel default tiers a bit within it:
>>
>> MEMORY_TIER_GPU		0
>> MEMORY_TIER_DRAM	10
>> MEMORY_TIER_PMEM	20
> 
> Forgive me if I'm asking a question that has been answered. I went
> back to earlier threads and couldn't work it out - maybe there were
> some off-list discussions? Anyway...
> 
> Why is there a distinction between tier ID and rank? I undestand that
> rank was added because tier IDs were too few. But if rank determines
> ordering, what is the use of a separate tier ID? IOW, why not make the
> tier ID space wider and have the kernel pick a few spread out defaults
> based on known hardware, with plenty of headroom to be future proof.
> 
>    $ ls tiers
>    100				# DEFAULT_TIER
>    $ cat tiers/100/nodelist
>    0-1				# conventional numa nodes
> 
>    <pmem is onlined>
> 
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1	# conventional numa
>    tiers/200/nodelist:2		# pmem
> 
>    $ grep . nodes/*/tier
>    nodes/0/tier:100
>    nodes/1/tier:100
>    nodes/2/tier:200
> 
>    <unknown device is online as node 3, defaults to 100>
> 
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1,3
>    tiers/200/nodelist:2
> 
>    $ echo 300 >nodes/3/tier
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1
>    tiers/200/nodelist:2
>    tiers/300/nodelist:3
> 
>    $ echo 200 >nodes/3/tier
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1	
>    tiers/200/nodelist:2-3
> 
> etc.

tier ID is also used as device id memtier.dev.id. It was discussed that 
we would need the ability to change the rank value of a memory tier. If 
we make rank value same as tier ID or tier device id, we will not be 
able to support that.

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs
  2022-06-08 16:06       ` Tim Chen
@ 2022-06-08 16:15         ` Aneesh Kumar K V
  0 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08 16:15 UTC (permalink / raw)
  To: Tim Chen, linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 9:36 PM, Tim Chen wrote:
> On Wed, 2022-06-08 at 10:25 +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 1:45 AM, Tim Chen wrote:
>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>>    
>>>> +static struct memory_tier *__node_get_memory_tier(int node)
>>>> +{
>>>> +	struct memory_tier *memtier;
>>>> +
>>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>>
>>> We could need to map node to mem_tier quite often, if we need
>>> to account memory usage at tier level.  It will be more efficient
>>> to have a pointer from node (pgdat) to memtier rather
>>> than doing a search through the list.
>>>
>>>
>>
>> That is something I was actively trying to avoid. Currently all struct
>> memory_tier references are with memory_tier_lock mutex held. That
>> simplify the locking and reference counting.
>>
>> As of now we are able to implement all the required interfaces without
>> pgdat having pointers to struct memory_tier. We can update pgdat with
>> memtier details when we are implementing changes requiring those. We
>> could keep additional memtier->dev reference to make sure memory tiers
>> are not destroyed while other part of the kernel is referencing the
>> same. But IMHO such changes should wait till we have users for the same.
>>
> 
> I think we should have an efficient mapping from node to memtier from
> the get go.  There are many easily envisioned scenarios where
> we need to map from node to memtier, which Ying pointed out.
> 

I did an initial implementation here. We need to make sure we can access 
NODE_DATA()->memtier lockless. Can you review the changes here

https://lore.kernel.org/linux-mm/87sfoffcfz.fsf@linux.ibm.com

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  1:34     ` Ying Huang
@ 2022-06-08 16:37       ` Yang Shi
  2022-06-09  6:52         ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Yang Shi @ 2022-06-08 16:37 UTC (permalink / raw)
  To: Ying Huang
  Cc: Aneesh Kumar K.V, Linux MM, Andrew Morton, Wei Xu, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > >
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > >
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > >
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > >
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > >
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > >
> > > This patch series address the above by defining memory tiers explicitly.
> > >
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > >
> > > /sys/devices/system/memtier/memtierN/
> > >
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > >
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > >
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > >
> > > The rank value of each memtier should be unique.
> > >
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > >
> > > For now we are not adding the dynamic number of memory tiers.
> > > But a future series supporting that is possible. Currently
> > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > node gets added to DEFAULT_MEMORY_TIER(1).
> > >
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > >
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > >
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >  include/linux/memory-tiers.h |  20 ++++
> > >  mm/Kconfig                   |  11 ++
> > >  mm/Makefile                  |   1 +
> > >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > >  4 files changed, 220 insertions(+)
> > >  create mode 100644 include/linux/memory-tiers.h
> > >  create mode 100644 mm/memory-tiers.c
> > >
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU    0
> > > +#define MEMORY_TIER_DRAM       1
> > > +#define MEMORY_TIER_PMEM       2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU    300
> > > +#define MEMORY_RANK_DRAM       200
> > > +#define MEMORY_RANK_PMEM       100
> > > +
> > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..08a3d330740b 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >  config ARCH_ENABLE_THP_MIGRATION
> > >         bool
> > >
> > > +config TIERED_MEMORY
> > > +       bool "Support for explicit memory tiers"
> > > +       def_bool n
> > > +       depends on MIGRATION && NUMA
> > > +       help
> > > +         Support to split nodes into memory tiers explicitly and
> > > +         to demote pages on reclaim to lower tiers. This option
> > > +         also exposes sysfs interface to read nodes available in
> > > +         specific tier and to move specific node among different
> > > +         possible tiers.
> >
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
>
> I think so too.  At least it appears unnecessary to let the user turn
> on/off it at configuration time.
>
> All the code should be enclosed by #if defined(CONFIG_NUMA) &&
> defined(CONIFIG_MIGRATION).  So we will not waste memory in small
> systems.

CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled
by default if NUMA is enabled. And MIGRATION is just used to support
demotion/promotion. Memory tiers exist even though demotion/promotion
is not supported, right?

>
> Best Regards,
> Huang, Ying
>
> > > +
> > >  config HUGETLB_PAGE_SIZE_VARIABLE
> > >         def_bool n
> > >         help
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index 6f9ffa968a1a..482557fbc9d1 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> > >  obj-$(CONFIG_FAILSLAB) += failslab.o
> > >  obj-$(CONFIG_MEMTEST)          += memtest.o
> > >  obj-$(CONFIG_MIGRATION) += migrate.o
> > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > new file mode 100644
> > > index 000000000000..7de18d94a08d
> > > --- /dev/null
> > > +++ b/mm/memory-tiers.c
> > > @@ -0,0 +1,188 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/types.h>
> > > +#include <linux/device.h>
> > > +#include <linux/nodemask.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/memory-tiers.h>
> > > +
> > > +struct memory_tier {
> > > +       struct list_head list;
> > > +       struct device dev;
> > > +       nodemask_t nodelist;
> > > +       int rank;
> > > +};
> > > +
> > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > > +
> > > +static struct bus_type memory_tier_subsys = {
> > > +       .name = "memtier",
> > > +       .dev_name = "memtier",
> > > +};
> > > +
> > > +static DEFINE_MUTEX(memory_tier_lock);
> > > +static LIST_HEAD(memory_tiers);
> > > +
> > > +
> > > +static ssize_t nodelist_show(struct device *dev,
> > > +                            struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > +
> > > +       return sysfs_emit(buf, "%*pbl\n",
> > > +                         nodemask_pr_args(&memtier->nodelist));
> > > +}
> > > +static DEVICE_ATTR_RO(nodelist);
> > > +
> > > +static ssize_t rank_show(struct device *dev,
> > > +                        struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > > +}
> > > +static DEVICE_ATTR_RO(rank);
> > > +
> > > +static struct attribute *memory_tier_dev_attrs[] = {
> > > +       &dev_attr_nodelist.attr,
> > > +       &dev_attr_rank.attr,
> > > +       NULL
> > > +};
> > > +
> > > +static const struct attribute_group memory_tier_dev_group = {
> > > +       .attrs = memory_tier_dev_attrs,
> > > +};
> > > +
> > > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > > +       &memory_tier_dev_group,
> > > +       NULL
> > > +};
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +       struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > > +       kfree(tier);
> > > +}
> > > +
> > > +/*
> > > + * Keep it simple by having  direct mapping between
> > > + * tier index and rank value.
> > > + */
> > > +static inline int get_rank_from_tier(unsigned int tier)
> > > +{
> > > +       switch (tier) {
> > > +       case MEMORY_TIER_HBM_GPU:
> > > +               return MEMORY_RANK_HBM_GPU;
> > > +       case MEMORY_TIER_DRAM:
> > > +               return MEMORY_RANK_DRAM;
> > > +       case MEMORY_TIER_PMEM:
> > > +               return MEMORY_RANK_PMEM;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void insert_memory_tier(struct memory_tier *memtier)
> > > +{
> > > +       struct list_head *ent;
> > > +       struct memory_tier *tmp_memtier;
> > > +
> > > +       list_for_each(ent, &memory_tiers) {
> > > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > > +               if (tmp_memtier->rank < memtier->rank) {
> > > +                       list_add_tail(&memtier->list, ent);
> > > +                       return;
> > > +               }
> > > +       }
> > > +       list_add_tail(&memtier->list, &memory_tiers);
> > > +}
> > > +
> > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > +{
> > > +       int error;
> > > +       struct memory_tier *memtier;
> > > +
> > > +       if (tier >= MAX_MEMORY_TIERS)
> > > +               return NULL;
> > > +
> > > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > +       if (!memtier)
> > > +               return NULL;
> > > +
> > > +       memtier->dev.id = tier;
> > > +       memtier->rank = get_rank_from_tier(tier);
> > > +       memtier->dev.bus = &memory_tier_subsys;
> > > +       memtier->dev.release = memory_tier_device_release;
> > > +       memtier->dev.groups = memory_tier_dev_groups;
> > > +
> > > +       insert_memory_tier(memtier);
> > > +
> > > +       error = device_register(&memtier->dev);
> > > +       if (error) {
> > > +               list_del(&memtier->list);
> > > +               put_device(&memtier->dev);
> > > +               return NULL;
> > > +       }
> > > +       return memtier;
> > > +}
> > > +
> > > +__maybe_unused // temporay to prevent warnings during bisects
> > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > +{
> > > +       list_del(&memtier->list);
> > > +       device_unregister(&memtier->dev);
> > > +}
> > > +
> > > +static ssize_t
> > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > +{
> > > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > > +}
> > > +static DEVICE_ATTR_RO(max_tier);
> > > +
> > > +static ssize_t
> > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > +{
> > > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > > +}
> > > +static DEVICE_ATTR_RO(default_tier);
> > > +
> > > +static struct attribute *memory_tier_attrs[] = {
> > > +       &dev_attr_max_tier.attr,
> > > +       &dev_attr_default_tier.attr,
> > > +       NULL
> > > +};
> > > +
> > > +static const struct attribute_group memory_tier_attr_group = {
> > > +       .attrs = memory_tier_attrs,
> > > +};
> > > +
> > > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > > +       &memory_tier_attr_group,
> > > +       NULL,
> > > +};
> > > +
> > > +static int __init memory_tier_init(void)
> > > +{
> > > +       int ret;
> > > +       struct memory_tier *memtier;
> > > +
> > > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > +       if (ret)
> > > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > > +
> > > +       /*
> > > +        * Register only default memory tier to hide all empty
> > > +        * memory tier from sysfs.
> > > +        */
> > > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > +       if (!memtier)
> > > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > +
> > > +       /* CPU only nodes are not part of memory tiers. */
> > > +       memtier->nodelist = node_states[N_MEMORY];
> > > +
> > > +       return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > > +
> > > --
> > > 2.36.1
> > >
>
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08  4:58     ` Aneesh Kumar K V
  2022-06-08  6:18       ` Ying Huang
@ 2022-06-08 16:42       ` Yang Shi
  2022-06-09  8:17         ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Yang Shi @ 2022-06-08 16:42 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Linux MM, Andrew Morton, Wei Xu, Huang Ying, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 6/8/22 3:02 AM, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> In the current kernel, memory tiers are defined implicitly via a
> >> demotion path relationship between NUMA nodes, which is created
> >> during the kernel initialization and updated when a NUMA node is
> >> hot-added or hot-removed.  The current implementation puts all
> >> nodes with CPU into the top tier, and builds the tier hierarchy
> >> tier-by-tier by establishing the per-node demotion targets based
> >> on the distances between nodes.
> >>
> >> This current memory tier kernel interface needs to be improved for
> >> several important use cases,
> >>
> >> The current tier initialization code always initializes
> >> each memory-only NUMA node into a lower tier.  But a memory-only
> >> NUMA node may have a high performance memory device (e.g. a DRAM
> >> device attached via CXL.mem or a DRAM-backed memory-only node on
> >> a virtual machine) and should be put into a higher tier.
> >>
> >> The current tier hierarchy always puts CPU nodes into the top
> >> tier. But on a system with HBM or GPU devices, the
> >> memory-only NUMA nodes mapping these devices should be in the
> >> top tier, and DRAM nodes with CPUs are better to be placed into the
> >> next lower tier.
> >>
> >> With current kernel higher tier node can only be demoted to selected nodes on the
> >> next lower tier as defined by the demotion path, not any other
> >> node from any lower tier.  This strict, hard-coded demotion order
> >> does not work in all use cases (e.g. some use cases may want to
> >> allow cross-socket demotion to another node in the same demotion
> >> tier as a fallback when the preferred demotion node is out of
> >> space), This demotion order is also inconsistent with the page
> >> allocation fallback order when all the nodes in a higher tier are
> >> out of space: The page allocation can fall back to any node from
> >> any lower tier, whereas the demotion order doesn't allow that.
> >>
> >> The current kernel also don't provide any interfaces for the
> >> userspace to learn about the memory tier hierarchy in order to
> >> optimize its memory allocations.
> >>
> >> This patch series address the above by defining memory tiers explicitly.
> >>
> >> This patch introduce explicity memory tiers with ranks. The rank
> >> value of a memory tier is used to derive the demotion order between
> >> NUMA nodes. The memory tiers present in a system can be found at
> >>
> >> /sys/devices/system/memtier/memtierN/
> >>
> >> The nodes which are part of a specific memory tier can be listed
> >> via
> >> /sys/devices/system/memtier/memtierN/nodelist
> >>
> >> "Rank" is an opaque value. Its absolute value doesn't have any
> >> special meaning. But the rank values of different memtiers can be
> >> compared with each other to determine the memory tier order.
> >>
> >> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> >> their rank values are 300, 200, 100, then the memory tier order is:
> >> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> >> and memtier1 is the lowest tier.
> >>
> >> The rank value of each memtier should be unique.
> >>
> >> A higher rank memory tier will appear first in the demotion order
> >> than a lower rank memory tier. ie. while reclaim we choose a node
> >> in higher rank memory tier to demote pages to as compared to a node
> >> in a lower rank memory tier.
> >>
> >> For now we are not adding the dynamic number of memory tiers.
> >> But a future series supporting that is possible. Currently
> >> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> >> When doing memory hotplug, if not added to a memory tier, the NUMA
> >> node gets added to DEFAULT_MEMORY_TIER(1).
> >>
> >> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> >>
> >> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> >>
> >> Suggested-by: Wei Xu <weixugc@google.com>
> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>   include/linux/memory-tiers.h |  20 ++++
> >>   mm/Kconfig                   |  11 ++
> >>   mm/Makefile                  |   1 +
> >>   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> >>   4 files changed, 220 insertions(+)
> >>   create mode 100644 include/linux/memory-tiers.h
> >>   create mode 100644 mm/memory-tiers.c
> >>
> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> >> new file mode 100644
> >> index 000000000000..e17f6b4ee177
> >> --- /dev/null
> >> +++ b/include/linux/memory-tiers.h
> >> @@ -0,0 +1,20 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#ifndef _LINUX_MEMORY_TIERS_H
> >> +#define _LINUX_MEMORY_TIERS_H
> >> +
> >> +#ifdef CONFIG_TIERED_MEMORY
> >> +
> >> +#define MEMORY_TIER_HBM_GPU    0
> >> +#define MEMORY_TIER_DRAM       1
> >> +#define MEMORY_TIER_PMEM       2
> >> +
> >> +#define MEMORY_RANK_HBM_GPU    300
> >> +#define MEMORY_RANK_DRAM       200
> >> +#define MEMORY_RANK_PMEM       100
> >> +
> >> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> >> +#define MAX_MEMORY_TIERS  3
> >> +
> >> +#endif /* CONFIG_TIERED_MEMORY */
> >> +
> >> +#endif
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 169e64192e48..08a3d330740b 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >>   config ARCH_ENABLE_THP_MIGRATION
> >>          bool
> >>
> >> +config TIERED_MEMORY
> >> +       bool "Support for explicit memory tiers"
> >> +       def_bool n
> >> +       depends on MIGRATION && NUMA
> >> +       help
> >> +         Support to split nodes into memory tiers explicitly and
> >> +         to demote pages on reclaim to lower tiers. This option
> >> +         also exposes sysfs interface to read nodes available in
> >> +         specific tier and to move specific node among different
> >> +         possible tiers.
> >
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
> >
>
> This was added so that we could avoid doing multiple
>
> #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
>
> Initially I had that as def_bool y and depends on MIGRATION && NUMA. But
> it was later suggested that def_bool is not recommended for newer config.
>
> How about
>
>   config TIERED_MEMORY
>         bool "Support for explicit memory tiers"
> -       def_bool n
> -       depends on MIGRATION && NUMA
> -       help
> -         Support to split nodes into memory tiers explicitly and
> -         to demote pages on reclaim to lower tiers. This option
> -         also exposes sysfs interface to read nodes available in
> -         specific tier and to move specific node among different
> -         possible tiers.
> +       def_bool MIGRATION && NUMA

CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
demotion/promotion has to be supported IMHO.

>
>   config HUGETLB_PAGE_SIZE_VARIABLE
>         def_bool n
>
> ie, we just make it a Kconfig variable without exposing it to the user?
>
> -aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 16:13       ` Aneesh Kumar K V
@ 2022-06-08 18:16         ` Johannes Weiner
  2022-06-09  2:33           ` Aneesh Kumar K V
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2022-06-08 18:16 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > Hello,
> > 
> > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > @@ -0,0 +1,20 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > +#define _LINUX_MEMORY_TIERS_H
> > > > +
> > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > +
> > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > +#define MEMORY_TIER_DRAM	1
> > > > +#define MEMORY_TIER_PMEM	2
> > > > +
> > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > +#define MEMORY_RANK_DRAM	200
> > > > +#define MEMORY_RANK_PMEM	100
> > > > +
> > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > +#define MAX_MEMORY_TIERS  3
> > > 
> > > I understand the names are somewhat arbitrary, and the tier ID space
> > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > 
> > > But starting out with a packed ID space can get quite awkward for
> > > users when new tiers - especially intermediate tiers - show up in
> > > existing configurations. I mentioned in the other email that DRAM !=
> > > DRAM, so new tiers seem inevitable already.
> > > 
> > > It could make sense to start with a bigger address space and spread
> > > out the list of kernel default tiers a bit within it:
> > > 
> > > MEMORY_TIER_GPU		0
> > > MEMORY_TIER_DRAM	10
> > > MEMORY_TIER_PMEM	20
> > 
> > Forgive me if I'm asking a question that has been answered. I went
> > back to earlier threads and couldn't work it out - maybe there were
> > some off-list discussions? Anyway...
> > 
> > Why is there a distinction between tier ID and rank? I undestand that
> > rank was added because tier IDs were too few. But if rank determines
> > ordering, what is the use of a separate tier ID? IOW, why not make the
> > tier ID space wider and have the kernel pick a few spread out defaults
> > based on known hardware, with plenty of headroom to be future proof.
> > 
> >    $ ls tiers
> >    100				# DEFAULT_TIER
> >    $ cat tiers/100/nodelist
> >    0-1				# conventional numa nodes
> > 
> >    <pmem is onlined>
> > 
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1	# conventional numa
> >    tiers/200/nodelist:2		# pmem
> > 
> >    $ grep . nodes/*/tier
> >    nodes/0/tier:100
> >    nodes/1/tier:100
> >    nodes/2/tier:200
> > 
> >    <unknown device is online as node 3, defaults to 100>
> > 
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1,3
> >    tiers/200/nodelist:2
> > 
> >    $ echo 300 >nodes/3/tier
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1
> >    tiers/200/nodelist:2
> >    tiers/300/nodelist:3
> > 
> >    $ echo 200 >nodes/3/tier
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1	
> >    tiers/200/nodelist:2-3
> > 
> > etc.
> 
> tier ID is also used as device id memtier.dev.id. It was discussed that we
> would need the ability to change the rank value of a memory tier. If we make
> rank value same as tier ID or tier device id, we will not be able to support
> that.

Is the idea that you could change the rank of a collection of nodes in
one go? Rather than moving the nodes one by one into a new tier?

[ Sorry, I wasn't able to find this discussion. AFAICS the first
  patches in RFC4 already had the struct device { .id = tier }
  logic. Could you point me to it? In general it would be really
  helpful to maintain summarized rationales for such decisions in the
  coverletter to make sure things don't get lost over many, many
  threads, conferences, and video calls. ]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-08 14:37                 ` Aneesh Kumar K.V
@ 2022-06-08 20:14                   ` Tim Chen
  2022-06-10  6:04                   ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Tim Chen @ 2022-06-08 20:14 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Ying Huang, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote:
> 
> 
> This is what I am testing now. We still need to closely audit that lock
> free access to the NODE_DATA()->memtier. 

You're refering to this or something else?  This is a write so seems okay.

> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;
> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();


> For v6 I will keep this as a
> separate patch and once we all agree that it is safe, I will fold it
> back.

Please update code that uses __node_get_memory_tier(node) to use
NODE_DATA(node)->memtier;

Otherwise the code looks okay at a first glance.

Tim

> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a388a806b61a..3e733de1a8a0 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -17,7 +17,6 @@
>  #define MAX_MEMORY_TIERS  (MAX_STATIC_MEMORY_TIERS + 2)
>  
>  extern bool numa_demotion_enabled;
> -extern nodemask_t promotion_mask;
>  int node_create_and_set_memory_tier(int node, int tier);
>  int next_demotion_node(int node);
>  int node_set_memory_tier(int node, int tier);
> @@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node);
>  int node_reset_memory_tier(int node, int tier);
>  void node_remove_from_memory_tier(int node);
>  void node_get_allowed_targets(int node, nodemask_t *targets);
> -
> -/*
> - * By default all nodes are top tiper. As we create new memory tiers
> - * we below top tiers we add them to NON_TOP_TIER state.
> - */
> -static inline bool node_is_toptier(int node)
> -{
> -	return !node_isset(node, promotion_mask);
> -}
> +bool node_is_toptier(int node);
>  
>  #else
>  #define numa_demotion_enabled	false
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aab70355d64f..c4fcfd2b9980 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>  	/* Per-node vmstats */
>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
> +#ifdef CONFIG_TIERED_MEMORY
> +	struct memory_tier *memtier;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 29a038bb38b0..31ef0fab5f19 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -7,6 +7,7 @@
>  #include <linux/random.h>
>  #include <linux/memory.h>
>  #include <linux/idr.h>
> +#include <linux/rcupdate.h>
>  
>  #include "internal.h"
>  
> @@ -26,7 +27,7 @@ struct demotion_nodes {
>  static void establish_migration_targets(void);
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> -nodemask_t promotion_mask;
> +static int top_tier_rank;
>  /*
>   * node_demotion[] examples:
>   *
> @@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev)
>  	if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS)
>  		ida_free(&memtier_dev_id, tier->dev.id);
>  
> -	kfree(tier);
> +	kfree_rcu(tier);
>  }
>  
>  /*
> @@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
>  	return NULL;
>  }
>  
> +/*
> + * Called with memory_tier_lock. Hence the device references cannot
> + * be dropped during this function.
> + */
> +static void memtier_node_clear(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	/*
> +	 * Make sure read side see the NULL value before we clear the node
> +	 * from the nodelist.
> +	 */
> +	synchronize_rcu();
> +	node_clear(node, memtier->nodelist);
> +}
> +
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();
> +	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->rank >= top_tier_rank)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
>  static int __node_create_and_set_memory_tier(int node, int tier)
>  {
>  	int ret = 0;
> @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>  			goto out;
>  		}
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
>  
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);
>  
>  	ret = __node_create_and_set_memory_tier(node, tier);
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> @@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier)
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
>  
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);
>  
>  	ret = __node_set_memory_tier(node, tier);
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> @@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node)
>  	 * empty then unregister it to make it invisible
>  	 * in sysfs.
>  	 */
> -	node_clear(node, memtier->nodelist);
> +	memtier_node_clear(node, memtier);
>  	if (nodes_empty(memtier->nodelist))
>  		unregister_memory_tier(memtier);
>  
> @@ -570,15 +635,13 @@ static void establish_migration_targets(void)
>  	 * a memory tier, we consider that tier as top tiper from
>  	 * which promotion is not allowed.
>  	 */
> -	promotion_mask = NODE_MASK_NONE;
>  	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>  		nodes_and(allowed, node_states[N_CPU], memtier->nodelist);
> -		if (nodes_empty(allowed))
> -			nodes_or(promotion_mask, promotion_mask, memtier->nodelist);
> -		else
> +		if (!nodes_empty(allowed)) {
> +			top_tier_rank = memtier->rank;
>  			break;
> +		}
>  	}
> -
>  	pr_emerg("top tier rank is %d\n", top_tier_rank);
>  	allowed = NODE_MASK_NONE;
>  	/*
> @@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>  
>  static int __init memory_tier_init(void)
>  {
> -	int ret;
> +	int ret, node;
>  	struct memory_tier *memtier;
>  
>  	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> @@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
>  		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>  
>  	/* CPU only nodes are not part of memory tiers. */
> -	memtier->nodelist = node_states[N_MEMORY];
> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;
> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();
>  
>  	return 0;


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 18:16         ` Johannes Weiner
@ 2022-06-09  2:33           ` Aneesh Kumar K V
  2022-06-09 13:55             ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-09  2:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 11:46 PM, Johannes Weiner wrote:
> On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 9:25 PM, Johannes Weiner wrote:
>>> Hello,
>>>
>>> On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
>>>> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>>>>> @@ -0,0 +1,20 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>> +
>>>>> +#ifdef CONFIG_TIERED_MEMORY
>>>>> +
>>>>> +#define MEMORY_TIER_HBM_GPU	0
>>>>> +#define MEMORY_TIER_DRAM	1
>>>>> +#define MEMORY_TIER_PMEM	2
>>>>> +
>>>>> +#define MEMORY_RANK_HBM_GPU	300
>>>>> +#define MEMORY_RANK_DRAM	200
>>>>> +#define MEMORY_RANK_PMEM	100
>>>>> +
>>>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>>>> +#define MAX_MEMORY_TIERS  3
>>>>
>>>> I understand the names are somewhat arbitrary, and the tier ID space
>>>> can be expanded down the line by bumping MAX_MEMORY_TIERS.
>>>>
>>>> But starting out with a packed ID space can get quite awkward for
>>>> users when new tiers - especially intermediate tiers - show up in
>>>> existing configurations. I mentioned in the other email that DRAM !=
>>>> DRAM, so new tiers seem inevitable already.
>>>>
>>>> It could make sense to start with a bigger address space and spread
>>>> out the list of kernel default tiers a bit within it:
>>>>
>>>> MEMORY_TIER_GPU		0
>>>> MEMORY_TIER_DRAM	10
>>>> MEMORY_TIER_PMEM	20
>>>
>>> Forgive me if I'm asking a question that has been answered. I went
>>> back to earlier threads and couldn't work it out - maybe there were
>>> some off-list discussions? Anyway...
>>>
>>> Why is there a distinction between tier ID and rank? I undestand that
>>> rank was added because tier IDs were too few. But if rank determines
>>> ordering, what is the use of a separate tier ID? IOW, why not make the
>>> tier ID space wider and have the kernel pick a few spread out defaults
>>> based on known hardware, with plenty of headroom to be future proof.
>>>
>>>     $ ls tiers
>>>     100				# DEFAULT_TIER
>>>     $ cat tiers/100/nodelist
>>>     0-1				# conventional numa nodes
>>>
>>>     <pmem is onlined>
>>>
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1	# conventional numa
>>>     tiers/200/nodelist:2		# pmem
>>>
>>>     $ grep . nodes/*/tier
>>>     nodes/0/tier:100
>>>     nodes/1/tier:100
>>>     nodes/2/tier:200
>>>
>>>     <unknown device is online as node 3, defaults to 100>
>>>
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1,3
>>>     tiers/200/nodelist:2
>>>
>>>     $ echo 300 >nodes/3/tier
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1
>>>     tiers/200/nodelist:2
>>>     tiers/300/nodelist:3
>>>
>>>     $ echo 200 >nodes/3/tier
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1	
>>>     tiers/200/nodelist:2-3
>>>
>>> etc.
>>
>> tier ID is also used as device id memtier.dev.id. It was discussed that we
>> would need the ability to change the rank value of a memory tier. If we make
>> rank value same as tier ID or tier device id, we will not be able to support
>> that.
> 
> Is the idea that you could change the rank of a collection of nodes in
> one go? Rather than moving the nodes one by one into a new tier?
> 
> [ Sorry, I wasn't able to find this discussion. AFAICS the first
>    patches in RFC4 already had the struct device { .id = tier }
>    logic. Could you point me to it? In general it would be really
>    helpful to maintain summarized rationales for such decisions in the
>    coverletter to make sure things don't get lost over many, many
>    threads, conferences, and video calls. ]

Most of the discussion happened not int he patch review email threads.

RFC: Memory Tiering Kernel Interfaces (v2)
https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com

RFC: Memory Tiering Kernel Interfaces (v4)
https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 16:37       ` Yang Shi
@ 2022-06-09  6:52         ` Ying Huang
  0 siblings, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-09  6:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: Aneesh Kumar K.V, Linux MM, Andrew Morton, Wei Xu, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 2022-06-08 at 09:37 -0700, Yang Shi wrote:
> On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > 
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created
> > > > during the kernel initialization and updated when a NUMA node is
> > > > hot-added or hot-removed.  The current implementation puts all
> > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > tier-by-tier by establishing the per-node demotion targets based
> > > > on the distances between nodes.
> > > > 
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases,
> > > > 
> > > > The current tier initialization code always initializes
> > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > a virtual machine) and should be put into a higher tier.
> > > > 
> > > > The current tier hierarchy always puts CPU nodes into the top
> > > > tier. But on a system with HBM or GPU devices, the
> > > > memory-only NUMA nodes mapping these devices should be in the
> > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > next lower tier.
> > > > 
> > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > next lower tier as defined by the demotion path, not any other
> > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > does not work in all use cases (e.g. some use cases may want to
> > > > allow cross-socket demotion to another node in the same demotion
> > > > tier as a fallback when the preferred demotion node is out of
> > > > space), This demotion order is also inconsistent with the page
> > > > allocation fallback order when all the nodes in a higher tier are
> > > > out of space: The page allocation can fall back to any node from
> > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > 
> > > > The current kernel also don't provide any interfaces for the
> > > > userspace to learn about the memory tier hierarchy in order to
> > > > optimize its memory allocations.
> > > > 
> > > > This patch series address the above by defining memory tiers explicitly.
> > > > 
> > > > This patch introduce explicity memory tiers with ranks. The rank
> > > > value of a memory tier is used to derive the demotion order between
> > > > NUMA nodes. The memory tiers present in a system can be found at
> > > > 
> > > > /sys/devices/system/memtier/memtierN/
> > > > 
> > > > The nodes which are part of a specific memory tier can be listed
> > > > via
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > 
> > > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > > special meaning. But the rank values of different memtiers can be
> > > > compared with each other to determine the memory tier order.
> > > > 
> > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > > their rank values are 300, 200, 100, then the memory tier order is:
> > > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > > and memtier1 is the lowest tier.
> > > > 
> > > > The rank value of each memtier should be unique.
> > > > 
> > > > A higher rank memory tier will appear first in the demotion order
> > > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > > in higher rank memory tier to demote pages to as compared to a node
> > > > in a lower rank memory tier.
> > > > 
> > > > For now we are not adding the dynamic number of memory tiers.
> > > > But a future series supporting that is possible. Currently
> > > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > > node gets added to DEFAULT_MEMORY_TIER(1).
> > > > 
> > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > > 
> > > > Suggested-by: Wei Xu <weixugc@google.com>
> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > ---
> > > >  include/linux/memory-tiers.h |  20 ++++
> > > >  mm/Kconfig                   |  11 ++
> > > >  mm/Makefile                  |   1 +
> > > >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 220 insertions(+)
> > > >  create mode 100644 include/linux/memory-tiers.h
> > > >  create mode 100644 mm/memory-tiers.c
> > > > 
> > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > > new file mode 100644
> > > > index 000000000000..e17f6b4ee177
> > > > --- /dev/null
> > > > +++ b/include/linux/memory-tiers.h
> > > > @@ -0,0 +1,20 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > +#define _LINUX_MEMORY_TIERS_H
> > > > +
> > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > +
> > > > +#define MEMORY_TIER_HBM_GPU    0
> > > > +#define MEMORY_TIER_DRAM       1
> > > > +#define MEMORY_TIER_PMEM       2
> > > > +
> > > > +#define MEMORY_RANK_HBM_GPU    300
> > > > +#define MEMORY_RANK_DRAM       200
> > > > +#define MEMORY_RANK_PMEM       100
> > > > +
> > > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > > +#define MAX_MEMORY_TIERS  3
> > > > +
> > > > +#endif /* CONFIG_TIERED_MEMORY */
> > > > +
> > > > +#endif
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index 169e64192e48..08a3d330740b 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > > >  config ARCH_ENABLE_THP_MIGRATION
> > > >         bool
> > > > 
> > > > +config TIERED_MEMORY
> > > > +       bool "Support for explicit memory tiers"
> > > > +       def_bool n
> > > > +       depends on MIGRATION && NUMA
> > > > +       help
> > > > +         Support to split nodes into memory tiers explicitly and
> > > > +         to demote pages on reclaim to lower tiers. This option
> > > > +         also exposes sysfs interface to read nodes available in
> > > > +         specific tier and to move specific node among different
> > > > +         possible tiers.
> > > 
> > > IMHO we should not need a new kernel config. If tiering is not present
> > > then there is just one tier on the system. And tiering is a kind of
> > > hardware configuration, the information could be shown regardless of
> > > whether demotion/promotion is supported/enabled or not.
> > 
> > I think so too.  At least it appears unnecessary to let the user turn
> > on/off it at configuration time.
> > 
> > All the code should be enclosed by #if defined(CONFIG_NUMA) &&
> > defined(CONIFIG_MIGRATION).  So we will not waste memory in small
> > systems.
> 
> CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled
> by default if NUMA is enabled. And MIGRATION is just used to support
> demotion/promotion. Memory tiers exist even though demotion/promotion
> is not supported, right?

Yes.  You are right.  For example, in the following patch, memory tiers
are used for allocation interleaving.

https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/

Best Regards,
Huang, Ying

> > 
> > > > +
> > > >  config HUGETLB_PAGE_SIZE_VARIABLE
> > > >         def_bool n
> > > >         help
> > > > diff --git a/mm/Makefile b/mm/Makefile
> > > > index 6f9ffa968a1a..482557fbc9d1 100644
> > > > --- a/mm/Makefile
> > > > +++ b/mm/Makefile
> > > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> > > >  obj-$(CONFIG_FAILSLAB) += failslab.o
> > > >  obj-$(CONFIG_MEMTEST)          += memtest.o
> > > >  obj-$(CONFIG_MIGRATION) += migrate.o
> > > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> > > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > > new file mode 100644
> > > > index 000000000000..7de18d94a08d
> > > > --- /dev/null
> > > > +++ b/mm/memory-tiers.c
> > > > @@ -0,0 +1,188 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +#include <linux/types.h>
> > > > +#include <linux/device.h>
> > > > +#include <linux/nodemask.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/memory-tiers.h>
> > > > +
> > > > +struct memory_tier {
> > > > +       struct list_head list;
> > > > +       struct device dev;
> > > > +       nodemask_t nodelist;
> > > > +       int rank;
> > > > +};
> > > > +
> > > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > > > +
> > > > +static struct bus_type memory_tier_subsys = {
> > > > +       .name = "memtier",
> > > > +       .dev_name = "memtier",
> > > > +};
> > > > +
> > > > +static DEFINE_MUTEX(memory_tier_lock);
> > > > +static LIST_HEAD(memory_tiers);
> > > > +
> > > > +
> > > > +static ssize_t nodelist_show(struct device *dev,
> > > > +                            struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%*pbl\n",
> > > > +                         nodemask_pr_args(&memtier->nodelist));
> > > > +}
> > > > +static DEVICE_ATTR_RO(nodelist);
> > > > +
> > > > +static ssize_t rank_show(struct device *dev,
> > > > +                        struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > > > +}
> > > > +static DEVICE_ATTR_RO(rank);
> > > > +
> > > > +static struct attribute *memory_tier_dev_attrs[] = {
> > > > +       &dev_attr_nodelist.attr,
> > > > +       &dev_attr_rank.attr,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static const struct attribute_group memory_tier_dev_group = {
> > > > +       .attrs = memory_tier_dev_attrs,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > > > +       &memory_tier_dev_group,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static void memory_tier_device_release(struct device *dev)
> > > > +{
> > > > +       struct memory_tier *tier = to_memory_tier(dev);
> > > > +
> > > > +       kfree(tier);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Keep it simple by having  direct mapping between
> > > > + * tier index and rank value.
> > > > + */
> > > > +static inline int get_rank_from_tier(unsigned int tier)
> > > > +{
> > > > +       switch (tier) {
> > > > +       case MEMORY_TIER_HBM_GPU:
> > > > +               return MEMORY_RANK_HBM_GPU;
> > > > +       case MEMORY_TIER_DRAM:
> > > > +               return MEMORY_RANK_DRAM;
> > > > +       case MEMORY_TIER_PMEM:
> > > > +               return MEMORY_RANK_PMEM;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void insert_memory_tier(struct memory_tier *memtier)
> > > > +{
> > > > +       struct list_head *ent;
> > > > +       struct memory_tier *tmp_memtier;
> > > > +
> > > > +       list_for_each(ent, &memory_tiers) {
> > > > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > > > +               if (tmp_memtier->rank < memtier->rank) {
> > > > +                       list_add_tail(&memtier->list, ent);
> > > > +                       return;
> > > > +               }
> > > > +       }
> > > > +       list_add_tail(&memtier->list, &memory_tiers);
> > > > +}
> > > > +
> > > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > > +{
> > > > +       int error;
> > > > +       struct memory_tier *memtier;
> > > > +
> > > > +       if (tier >= MAX_MEMORY_TIERS)
> > > > +               return NULL;
> > > > +
> > > > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > > +       if (!memtier)
> > > > +               return NULL;
> > > > +
> > > > +       memtier->dev.id = tier;
> > > > +       memtier->rank = get_rank_from_tier(tier);
> > > > +       memtier->dev.bus = &memory_tier_subsys;
> > > > +       memtier->dev.release = memory_tier_device_release;
> > > > +       memtier->dev.groups = memory_tier_dev_groups;
> > > > +
> > > > +       insert_memory_tier(memtier);
> > > > +
> > > > +       error = device_register(&memtier->dev);
> > > > +       if (error) {
> > > > +               list_del(&memtier->list);
> > > > +               put_device(&memtier->dev);
> > > > +               return NULL;
> > > > +       }
> > > > +       return memtier;
> > > > +}
> > > > +
> > > > +__maybe_unused // temporay to prevent warnings during bisects
> > > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > > +{
> > > > +       list_del(&memtier->list);
> > > > +       device_unregister(&memtier->dev);
> > > > +}
> > > > +
> > > > +static ssize_t
> > > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > > > +}
> > > > +static DEVICE_ATTR_RO(max_tier);
> > > > +
> > > > +static ssize_t
> > > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > > > +}
> > > > +static DEVICE_ATTR_RO(default_tier);
> > > > +
> > > > +static struct attribute *memory_tier_attrs[] = {
> > > > +       &dev_attr_max_tier.attr,
> > > > +       &dev_attr_default_tier.attr,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static const struct attribute_group memory_tier_attr_group = {
> > > > +       .attrs = memory_tier_attrs,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > > > +       &memory_tier_attr_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static int __init memory_tier_init(void)
> > > > +{
> > > > +       int ret;
> > > > +       struct memory_tier *memtier;
> > > > +
> > > > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > > +       if (ret)
> > > > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > > > +
> > > > +       /*
> > > > +        * Register only default memory tier to hide all empty
> > > > +        * memory tier from sysfs.
> > > > +        */
> > > > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > > +       if (!memtier)
> > > > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > > +
> > > > +       /* CPU only nodes are not part of memory tiers. */
> > > > +       memtier->nodelist = node_states[N_MEMORY];
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +subsys_initcall(memory_tier_init);
> > > > +
> > > > --
> > > > 2.36.1
> > > > 
> > 
> > 



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-08 16:42       ` Yang Shi
@ 2022-06-09  8:17         ` Aneesh Kumar K V
  2022-06-09 16:04           ` Yang Shi
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-09  8:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Linux MM, Andrew Morton, Wei Xu, Huang Ying, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 10:12 PM, Yang Shi wrote:
> On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:

....

>>    config TIERED_MEMORY
>>          bool "Support for explicit memory tiers"
>> -       def_bool n
>> -       depends on MIGRATION && NUMA
>> -       help
>> -         Support to split nodes into memory tiers explicitly and
>> -         to demote pages on reclaim to lower tiers. This option
>> -         also exposes sysfs interface to read nodes available in
>> -         specific tier and to move specific node among different
>> -         possible tiers.
>> +       def_bool MIGRATION && NUMA
> 
> CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
> demotion/promotion has to be supported IMHO.
> 
>>
>>    config HUGETLB_PAGE_SIZE_VARIABLE
>>          def_bool n
>>
>> ie, we just make it a Kconfig variable without exposing it to the user?
>>

We can do that but that would also mean in order to avoid building the 
demotion targets etc we will now have to have multiple #ifdef 
CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef 
So these are not really build errors, but rather we will be building all 
the demotion targets for no real use with them.

What usecase do you have to expose memory tiers on a system with 
CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all 
configs these days due to its dependency against COMPACTION and 
TRANSPARENT_HUGEPAGE.

Unless there is a real need, I am wondering if we can avoid sprinkling 
#ifdef CONFIG_MIGRATION in mm/memory-tiers.c

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 0/9] mm/demotion: Memory tiers and demotion
  2022-06-08 14:20   ` Aneesh Kumar K V
@ 2022-06-09  8:53     ` Jonathan Cameron
  0 siblings, 0 replies; 84+ messages in thread
From: Jonathan Cameron @ 2022-06-09  8:53 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Johannes Weiner, linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen,
	Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 8 Jun 2022 19:50:11 +0530
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> On 6/8/22 7:27 PM, Johannes Weiner wrote:
> > Hi Aneesh,
> > 
> > On Fri, Jun 03, 2022 at 07:12:28PM +0530, Aneesh Kumar K.V wrote:  
> >> * The current tier initialization code always initializes
> >>    each memory-only NUMA node into a lower tier.  But a memory-only
> >>    NUMA node may have a high performance memory device (e.g. a DRAM
> >>    device attached via CXL.mem or a DRAM-backed memory-only node on
> >>    a virtual machine) and should be put into a higher tier.  
> > 
> > I have to disagree with this premise. The CXL.mem bus has different
> > latency and bandwidth characteristics. It's also conceivable that
> > cheaper and slower DRAM is connected to the CXL bus (think recycling
> > DDR4 DIMMS after switching to DDR5). DRAM != DRAM.
> > 
> > Our experiments with production workloads show regressions between
> > 15-30% in serviced requests when you don't distinguish toptier DRAM
> > from lower tier DRAM. While it's fixable with manual tuning, your
> > patches would bring reintroduce this regression it seems.
> > 
> > Making tiers explicit is a good idea, but can we keep the current
> > default that CPU-less nodes are of a lower tier than ones with CPU?
> > I'm having a hard time imagining where this wouldn't be true... Or why
> > it shouldn't be those esoteric cases that need the manual tuning.  
> 
> This was mostly driven by virtual machine configs where we can find 
> memory only NUMA nodes depending on the resource availability in the 
> hypervisor.
> 
> Will these CXL devices be initialized by a driver? 

In many cases no (almost all cases pre CXL 2.0) - they are setup by the
BIOS / firmware and presented just like normal memory (except pmem in which case
kmem will be relevant as you suggest).  Hopefully everyone will follow
the guidance and provide appropriate HMAT as well as SLIT.

If we want to do a better job of the default policy, we should take
the actual distances into account (ideally including the detail
HMAT provides).

Jonathan

> For example, if they 
> are going to be initialized via dax kmem, we already consider them lower 
> memory tier as with this patch series.
> 
> -aneesh


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09  2:33           ` Aneesh Kumar K V
@ 2022-06-09 13:55             ` Johannes Weiner
  2022-06-09 14:22               ` Jonathan Cameron
  0 siblings, 1 reply; 84+ messages in thread
From: Johannes Weiner @ 2022-06-09 13:55 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 11:46 PM, Johannes Weiner wrote:
> > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> > > On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > > > Hello,
> > > > 
> > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > > > @@ -0,0 +1,20 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > +
> > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > +
> > > > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > > > +#define MEMORY_TIER_DRAM	1
> > > > > > +#define MEMORY_TIER_PMEM	2
> > > > > > +
> > > > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > > > +#define MEMORY_RANK_DRAM	200
> > > > > > +#define MEMORY_RANK_PMEM	100
> > > > > > +
> > > > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > > > +#define MAX_MEMORY_TIERS  3
> > > > > 
> > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > > 
> > > > > But starting out with a packed ID space can get quite awkward for
> > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > DRAM, so new tiers seem inevitable already.
> > > > > 
> > > > > It could make sense to start with a bigger address space and spread
> > > > > out the list of kernel default tiers a bit within it:
> > > > > 
> > > > > MEMORY_TIER_GPU		0
> > > > > MEMORY_TIER_DRAM	10
> > > > > MEMORY_TIER_PMEM	20
> > > > 
> > > > Forgive me if I'm asking a question that has been answered. I went
> > > > back to earlier threads and couldn't work it out - maybe there were
> > > > some off-list discussions? Anyway...
> > > > 
> > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > rank was added because tier IDs were too few. But if rank determines
> > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > based on known hardware, with plenty of headroom to be future proof.
> > > > 
> > > >     $ ls tiers
> > > >     100				# DEFAULT_TIER
> > > >     $ cat tiers/100/nodelist
> > > >     0-1				# conventional numa nodes
> > > > 
> > > >     <pmem is onlined>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	# conventional numa
> > > >     tiers/200/nodelist:2		# pmem
> > > > 
> > > >     $ grep . nodes/*/tier
> > > >     nodes/0/tier:100
> > > >     nodes/1/tier:100
> > > >     nodes/2/tier:200
> > > > 
> > > >     <unknown device is online as node 3, defaults to 100>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1,3
> > > >     tiers/200/nodelist:2
> > > > 
> > > >     $ echo 300 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1
> > > >     tiers/200/nodelist:2
> > > >     tiers/300/nodelist:3
> > > > 
> > > >     $ echo 200 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	
> > > >     tiers/200/nodelist:2-3
> > > > 
> > > > etc.
> > > 
> > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > would need the ability to change the rank value of a memory tier. If we make
> > > rank value same as tier ID or tier device id, we will not be able to support
> > > that.
> > 
> > Is the idea that you could change the rank of a collection of nodes in
> > one go? Rather than moving the nodes one by one into a new tier?
> > 
> > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> >    patches in RFC4 already had the struct device { .id = tier }
> >    logic. Could you point me to it? In general it would be really
> >    helpful to maintain summarized rationales for such decisions in the
> >    coverletter to make sure things don't get lost over many, many
> >    threads, conferences, and video calls. ]
> 
> Most of the discussion happened not int he patch review email threads.
> 
> RFC: Memory Tiering Kernel Interfaces (v2)
> https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
> 
> RFC: Memory Tiering Kernel Interfaces (v4)
> https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

I read the RFCs, the discussions and your code. It's still not clear
why the tier/device ID and the rank need to be two separate,
user-visible things. There is only one tier of a given rank, why can't
the rank be the unique device id? dev->id = 100. One number. Or use a
unique device id allocator if large numbers are causing problems
internally. But I don't see an explanation why they need to be two
different things, let alone two different things in the user ABI.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09 13:55             ` Johannes Weiner
@ 2022-06-09 14:22               ` Jonathan Cameron
  2022-06-09 20:41                 ` Johannes Weiner
  0 siblings, 1 reply; 84+ messages in thread
From: Jonathan Cameron @ 2022-06-09 14:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, 9 Jun 2022 09:55:45 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> > On 6/8/22 11:46 PM, Johannes Weiner wrote:  
> > > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:  
> > > > On 6/8/22 9:25 PM, Johannes Weiner wrote:  
> > > > > Hello,
> > > > > 
> > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:  
> > > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:  
> > > > > > > @@ -0,0 +1,20 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > > +
> > > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > > +
> > > > > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > > > > +#define MEMORY_TIER_DRAM	1
> > > > > > > +#define MEMORY_TIER_PMEM	2
> > > > > > > +
> > > > > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > > > > +#define MEMORY_RANK_DRAM	200
> > > > > > > +#define MEMORY_RANK_PMEM	100
> > > > > > > +
> > > > > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > > > > +#define MAX_MEMORY_TIERS  3  
> > > > > > 
> > > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > > > 
> > > > > > But starting out with a packed ID space can get quite awkward for
> > > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > > DRAM, so new tiers seem inevitable already.
> > > > > > 
> > > > > > It could make sense to start with a bigger address space and spread
> > > > > > out the list of kernel default tiers a bit within it:
> > > > > > 
> > > > > > MEMORY_TIER_GPU		0
> > > > > > MEMORY_TIER_DRAM	10
> > > > > > MEMORY_TIER_PMEM	20  
> > > > > 
> > > > > Forgive me if I'm asking a question that has been answered. I went
> > > > > back to earlier threads and couldn't work it out - maybe there were
> > > > > some off-list discussions? Anyway...
> > > > > 
> > > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > > rank was added because tier IDs were too few. But if rank determines
> > > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > > based on known hardware, with plenty of headroom to be future proof.
> > > > > 
> > > > >     $ ls tiers
> > > > >     100				# DEFAULT_TIER
> > > > >     $ cat tiers/100/nodelist
> > > > >     0-1				# conventional numa nodes
> > > > > 
> > > > >     <pmem is onlined>
> > > > > 
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1	# conventional numa
> > > > >     tiers/200/nodelist:2		# pmem
> > > > > 
> > > > >     $ grep . nodes/*/tier
> > > > >     nodes/0/tier:100
> > > > >     nodes/1/tier:100
> > > > >     nodes/2/tier:200
> > > > > 
> > > > >     <unknown device is online as node 3, defaults to 100>
> > > > > 
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1,3
> > > > >     tiers/200/nodelist:2
> > > > > 
> > > > >     $ echo 300 >nodes/3/tier
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1
> > > > >     tiers/200/nodelist:2
> > > > >     tiers/300/nodelist:3
> > > > > 
> > > > >     $ echo 200 >nodes/3/tier
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1	
> > > > >     tiers/200/nodelist:2-3
> > > > > 
> > > > > etc.  
> > > > 
> > > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > > would need the ability to change the rank value of a memory tier. If we make
> > > > rank value same as tier ID or tier device id, we will not be able to support
> > > > that.  
> > > 
> > > Is the idea that you could change the rank of a collection of nodes in
> > > one go? Rather than moving the nodes one by one into a new tier?
> > > 
> > > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> > >    patches in RFC4 already had the struct device { .id = tier }
> > >    logic. Could you point me to it? In general it would be really
> > >    helpful to maintain summarized rationales for such decisions in the
> > >    coverletter to make sure things don't get lost over many, many
> > >    threads, conferences, and video calls. ]  
> > 
> > Most of the discussion happened not int he patch review email threads.
> > 
> > RFC: Memory Tiering Kernel Interfaces (v2)
> > https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
> > 
> > RFC: Memory Tiering Kernel Interfaces (v4)
> > https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com  
> 
> I read the RFCs, the discussions and your code. It's still not clear
> why the tier/device ID and the rank need to be two separate,
> user-visible things. There is only one tier of a given rank, why can't
> the rank be the unique device id? dev->id = 100. One number. Or use a
> unique device id allocator if large numbers are causing problems
> internally. But I don't see an explanation why they need to be two
> different things, let alone two different things in the user ABI.

I think discussion hinged on it making sense to be able to change
rank of a tier rather than create a new tier and move things one by one.
Example was wanting to change the rank of a tier that was created
either by core code or a subsystem.

E.g. If GPU driver creates a tier, assumption is all similar GPUs will
default to the same tier (if hot plugged later for example) as the
driver subsystem will keep a reference to the created tier.
Hence if user wants to change the order of that relative to
other tiers, the option of creating a new tier and moving the
devices would then require us to have infrastructure to tell the GPU
driver to now use the new tier for additional devices.

Or we could go with new nodes are not assigned to a tier and userspace is
always responsible for that assignment.  That may be a problem for
anything relying on existing behavior.  Means that there must always
be a sensible userspace script...

Jonathan





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09  8:17         ` Aneesh Kumar K V
@ 2022-06-09 16:04           ` Yang Shi
  0 siblings, 0 replies; 84+ messages in thread
From: Yang Shi @ 2022-06-09 16:04 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Linux MM, Andrew Morton, Wei Xu, Huang Ying, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, Jun 9, 2022 at 1:18 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 6/8/22 10:12 PM, Yang Shi wrote:
> > On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
>
> ....
>
> >>    config TIERED_MEMORY
> >>          bool "Support for explicit memory tiers"
> >> -       def_bool n
> >> -       depends on MIGRATION && NUMA
> >> -       help
> >> -         Support to split nodes into memory tiers explicitly and
> >> -         to demote pages on reclaim to lower tiers. This option
> >> -         also exposes sysfs interface to read nodes available in
> >> -         specific tier and to move specific node among different
> >> -         possible tiers.
> >> +       def_bool MIGRATION && NUMA
> >
> > CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
> > demotion/promotion has to be supported IMHO.
> >
> >>
> >>    config HUGETLB_PAGE_SIZE_VARIABLE
> >>          def_bool n
> >>
> >> ie, we just make it a Kconfig variable without exposing it to the user?
> >>
>
> We can do that but that would also mean in order to avoid building the
> demotion targets etc we will now have to have multiple #ifdef
> CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef
> So these are not really build errors, but rather we will be building all
> the demotion targets for no real use with them.

Can we have default demotion targets for !MIGRATION? For example, all
demotion targets are -1.

>
> What usecase do you have to expose memory tiers on a system with
> CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all
> configs these days due to its dependency against COMPACTION and
> TRANSPARENT_HUGEPAGE.

Johannes's interleave series is an example,
https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/

It doesn't do any demotion/promotion, just make allocations interleave
on different tiers.

>
> Unless there is a real need, I am wondering if we can avoid sprinkling
> #ifdef CONFIG_MIGRATION in mm/memory-tiers.c
>
> -aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09 14:22               ` Jonathan Cameron
@ 2022-06-09 20:41                 ` Johannes Weiner
  2022-06-10  6:15                   ` Ying Huang
  2022-06-10  9:57                   ` Jonathan Cameron
  0 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2022-06-09 20:41 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> I think discussion hinged on it making sense to be able to change
> rank of a tier rather than create a new tier and move things one by one.
> Example was wanting to change the rank of a tier that was created
> either by core code or a subsystem.
> 
> E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> default to the same tier (if hot plugged later for example) as the
> driver subsystem will keep a reference to the created tier.
> Hence if user wants to change the order of that relative to
> other tiers, the option of creating a new tier and moving the
> devices would then require us to have infrastructure to tell the GPU
> driver to now use the new tier for additional devices.

That's an interesting point, thanks for explaining.

But that could still happen when two drivers report the same tier and
one of them is wrong, right? You'd still need to separate out by hand
to adjust rank, as well as handle hotplug events. Driver colllisions
are probable with coarse categories like gpu, dram, pmem.

Would it make more sense to have the platform/devicetree/driver
provide more fine-grained distance values similar to NUMA distances,
and have a driver-scope tunable to override/correct? And then have the
distance value function as the unique tier ID and rank in one.

That would allow device class reassignments, too, and it would work
with driver collisions where simple "tier stickiness" would
not. (Although collisions would be less likely to begin with given a
broader range of possible distance values.)

Going further, it could be useful to separate the business of hardware
properties (and configuring quirks) from the business of configuring
MM policies that should be applied to the resulting tier hierarchy.
They're somewhat orthogonal tuning tasks, and one of them might become
obsolete before the other (if the quality of distance values provided
by drivers improves before the quality of MM heuristics ;). Separating
them might help clarify the interface for both designers and users.

E.g. a memdev class scope with a driver-wide distance value, and a
memdev scope for per-device values that default to "inherit driver
value". The memtier subtree would then have an r/o structure, but
allow tuning per-tier interleaving ratio[1], demotion rules etc.

[1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-06-08 14:37                 ` Aneesh Kumar K.V
  2022-06-08 20:14                   ` Tim Chen
@ 2022-06-10  6:04                   ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-10  6:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Wei Xu, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote:
> Ying Huang <ying.huang@intel.com> writes:
> 
> ....
> 
> > > > 
> > > > > is this good (not tested)?
> > > > > /*
> > > > >    * build the allowed promotion mask. Promotion is allowed
> > > > >    * from higher memory tier to lower memory tier only if
> > > > >    * lower memory tier doesn't include compute. We want to
> > > > >    * skip promotion from a memory tier, if any node which is
> > > > >    * part of that memory tier have CPUs. Once we detect such
> > > > >    * a memory tier, we consider that tier as top tier from
> > > > >    * which promotion is not allowed.
> > > > >    */
> > > > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> > > > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> > > > > 	if (nodes_empty(allowed))
> > > > > 		nodes_or(promotion_mask, promotion_mask, allowed);
> > > > > 	else
> > > > > 		break;
> > > > > }
> > > > > 
> > > > > and then
> > > > > 
> > > > > static inline bool node_is_toptier(int node)
> > > > > {
> > > > > 
> > > > > 	return !node_isset(node, promotion_mask);
> > > > > }
> > > > > 
> > > > 
> > > > This should work.  But it appears unnatural.  So, I don't think we
> > > > should avoid to add more and more node masks to mitigate the design
> > > > decision that we cannot access memory tier information directly.  All
> > > > these becomes simple and natural, if we can access memory tier
> > > > information directly.
> > > > 
> > > 
> > > how do you derive whether node is toptier details if we have memtier 
> > > details in pgdat?
> > 
> > pgdat -> memory tier -> rank
> > 
> > Then we can compare this rank with the fast memory rank.  The fast
> > memory rank can be calculated dynamically at appropriate places.
> 
> This is what I am testing now. We still need to closely audit that lock
> free access to the NODE_DATA()->memtier. For v6 I will keep this as a
> separate patch and once we all agree that it is safe, I will fold it
> back.

Thanks for doing this.  We finally have a way to access memory_tier in
hot path.

[snip]

> +/*
> + * Called with memory_tier_lock. Hence the device references cannot
> + * be dropped during this function.
> + */
> +static void memtier_node_clear(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	/*
> +	 * Make sure read side see the NULL value before we clear the node
> +	 * from the nodelist.
> +	 */
> +	synchronize_rcu();
> +	node_clear(node, memtier->nodelist);
> +}
> +
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> 
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();

Per my understanding, in your code, when we change pgdat->memtier, we
will call synchronize_rcu() twice.  IMHO, once should be OK.  That is,
something like below,

	rcu_assign_pointer(pgdat->memtier, NULL);
	node_clear(node, memtier->nodelist);
	synchronize_rcu();
	node_set(node, new_memtier->nodelist);
	rcu_assign_pointer(pgdat->memtier, new_memtier);

In this way, there will be 3 states,

1. prev

pgdat->memtier == old_memtier
node_isset(node, old_memtier->node_list)
!node_isset(node, new_memtier->node_list)

2. transitioning

pgdat->memtier == NULL
!node_isset(node, old_memtier->node_list)
!node_isset(node, new_memtier->node_list)

3. after

pgdat->memtier == new_memtier
!node_isset(node, old_memtier->node_list)
node_isset(node, new_memtier->node_list)

The real state may be one of 1, 2, 3, 1+2, or 2+3.  But it will not be
1+3.  I think that this satisfied our requirements.

[snip]

>  static int __node_create_and_set_memory_tier(int node, int tier)
>  {
>  	int ret = 0;
> @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>  			goto out;
>  		}
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);+	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->rank >= top_tier_rank)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
> 
>   	ret = __node_create_and_set_memory_tier(node, tier);
> 
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> 

[snip]

>  static int __init memory_tier_init(void)
>  {
> -	int ret;
> +	int ret, node;
>  	struct memory_tier *memtier;
>
>  	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> 
> @@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
> 
>  		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>
>  	/* CPU only nodes are not part of memory tiers. */
> 
> -	memtier->nodelist = node_states[N_MEMORY];
> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;

No required absoluately.  But IMHO it's more consistent to use
rcu_assign_pointer() here.

> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();
>  
> > 	return 0;

Best Regareds,
Huang, Ying



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09 20:41                 ` Johannes Weiner
@ 2022-06-10  6:15                   ` Ying Huang
  2022-06-10  9:57                   ` Jonathan Cameron
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-10  6:15 UTC (permalink / raw)
  To: Johannes Weiner, Jonathan Cameron
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Thu, 2022-06-09 at 16:41 -0400, Johannes Weiner wrote:
> On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > I think discussion hinged on it making sense to be able to change
> > rank of a tier rather than create a new tier and move things one by one.
> > Example was wanting to change the rank of a tier that was created
> > either by core code or a subsystem.
> > 
> > E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> > default to the same tier (if hot plugged later for example) as the
> > driver subsystem will keep a reference to the created tier.
> > Hence if user wants to change the order of that relative to
> > other tiers, the option of creating a new tier and moving the
> > devices would then require us to have infrastructure to tell the GPU
> > driver to now use the new tier for additional devices.
> 
> That's an interesting point, thanks for explaining.

I have proposed to use sparse memory tier device ID and remove rank. 
The response from Wei Xu is as follows,

"
Using the rank value directly as the device ID has some disadvantages:
- It is kind of unconventional to number devices in this way.
- We cannot assign DRAM nodes with CPUs with a specific memtier device
ID (even though this is not mandated by the "rank" proposal, I expect
the device will likely always be memtier1 in practice).
- It is possible that we may eventually allow the rank value to be
modified as a way to adjust the tier ordering.  We cannot do that
easily for device IDs.
"

in

https://lore.kernel.org/lkml/CAAPL-u9t=9hYfcXyCZwYFmVTUQGrWVq3cndpN+sqPSm5cwE4Yg@mail.gmail.com/

I think that your proposal below has resolved the latter "disadvantage".
So if the former one isn't so important, we can go to remove "rank". 
That will make memory tier much easier to be understand and use.

Best Regards,
Huang, Ying

> But that could still happen when two drivers report the same tier and
> one of them is wrong, right? You'd still need to separate out by hand
> to adjust rank, as well as handle hotplug events. Driver colllisions
> are probable with coarse categories like gpu, dram, pmem.
> 
> Would it make more sense to have the platform/devicetree/driver
> provide more fine-grained distance values similar to NUMA distances,
> and have a driver-scope tunable to override/correct? And then have the
> distance value function as the unique tier ID and rank in one.
> 
> That would allow device class reassignments, too, and it would work
> with driver collisions where simple "tier stickiness" would
> not. (Although collisions would be less likely to begin with given a
> broader range of possible distance values.)
> 
> Going further, it could be useful to separate the business of hardware
> properties (and configuring quirks) from the business of configuring
> MM policies that should be applied to the resulting tier hierarchy.
> They're somewhat orthogonal tuning tasks, and one of them might become
> obsolete before the other (if the quality of distance values provided
> by drivers improves before the quality of MM heuristics ;). Separating
> them might help clarify the interface for both designers and users.
> 
> E.g. a memdev class scope with a driver-wide distance value, and a
> memdev scope for per-device values that default to "inherit driver
> value". The memtier subtree would then have an r/o structure, but
> allow tuning per-tier interleaving ratio[1], demotion rules etc.
> 
> [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-09 20:41                 ` Johannes Weiner
  2022-06-10  6:15                   ` Ying Huang
@ 2022-06-10  9:57                   ` Jonathan Cameron
  2022-06-13 14:05                     ` Johannes Weiner
  1 sibling, 1 reply; 84+ messages in thread
From: Jonathan Cameron @ 2022-06-10  9:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, 9 Jun 2022 16:41:04 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > I think discussion hinged on it making sense to be able to change
> > rank of a tier rather than create a new tier and move things one by one.
> > Example was wanting to change the rank of a tier that was created
> > either by core code or a subsystem.
> > 
> > E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> > default to the same tier (if hot plugged later for example) as the
> > driver subsystem will keep a reference to the created tier.
> > Hence if user wants to change the order of that relative to
> > other tiers, the option of creating a new tier and moving the
> > devices would then require us to have infrastructure to tell the GPU
> > driver to now use the new tier for additional devices.  
> 
> That's an interesting point, thanks for explaining.
> 
> But that could still happen when two drivers report the same tier and
> one of them is wrong, right? You'd still need to separate out by hand
> to adjust rank, as well as handle hotplug events. Driver colllisions
> are probable with coarse categories like gpu, dram, pmem.

There will always be cases that need hand tweaking.  Also I'd envision
some driver subsystems being clever enough to manage several tiers and
use the information available to them to assign appropriately.  This
is definitely true for CXL 2.0+ devices where we can have radically
different device types under the same driver (volatile, persistent,
direct connect, behind switches etc).  There will be some interesting
choices to make on groupings in big systems as we don't want too many
tiers unless we naturally demote multiple levels in one go..

> 
> Would it make more sense to have the platform/devicetree/driver
> provide more fine-grained distance values similar to NUMA distances,
> and have a driver-scope tunable to override/correct? And then have the
> distance value function as the unique tier ID and rank in one.

Absolutely a good thing to provide that information, but it's black
magic. There are too many contradicting metrics (latency vs bandwidth etc)
even not including a more complex system model like Jerome Glisse proposed
a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
CXL 2.0 got this more right than anything else I've seen as provides
discoverable topology along with details like latency to cross between
particular switch ports.  Actually using that data (other than by throwing
it to userspace controls for HPC apps etc) is going to take some figuring out.
Even the question of what + how we expose this info to userspace is non
obvious.

The 'right' decision is also usecase specific, so what you'd do for
particular memory characteristics for a GPU are not the same as what
you'd do for the same characteristics on a memory only device.

> 
> That would allow device class reassignments, too, and it would work
> with driver collisions where simple "tier stickiness" would
> not. (Although collisions would be less likely to begin with given a
> broader range of possible distance values.)

I think we definitely need the option to move individual nodes (in this
case nodes map to individual devices if characteristics vary between them)
around as well, but I think that's somewhat orthogonal to a good first guess.

> 
> Going further, it could be useful to separate the business of hardware
> properties (and configuring quirks) from the business of configuring
> MM policies that should be applied to the resulting tier hierarchy.
> They're somewhat orthogonal tuning tasks, and one of them might become
> obsolete before the other (if the quality of distance values provided
> by drivers improves before the quality of MM heuristics ;). Separating
> them might help clarify the interface for both designers and users.
> 
> E.g. a memdev class scope with a driver-wide distance value, and a
> memdev scope for per-device values that default to "inherit driver
> value". The memtier subtree would then have an r/o structure, but
> allow tuning per-tier interleaving ratio[1], demotion rules etc.

Ok that makes sense.  I'm not sure if that ends up as an implementation
detail, or effects the userspace interface of this particular element.

I'm not sure completely read only is flexible enough (though mostly RO is fine)
as we keep sketching out cases where any attempt to do things automatically
does the wrong thing and where we need to add an extra tier to get
everything to work.  Short of having a lot of tiers I'm not sure how
we could have the default work well.  Maybe a lot of "tiers" is fine
though perhaps we need to rename them if going this way and then they
don't really work as current concept of tier.

Imagine a system with subtle difference between different memories such
as 10% latency increase for same bandwidth.  To get an advantage from
demoting to such a tier will require really stable usage and long
run times. Whilst you could design a demotion scheme that takes that
into account, I think we are a long way from that today.

Jonathan


> 
> [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-10  9:57                   ` Jonathan Cameron
@ 2022-06-13 14:05                     ` Johannes Weiner
  2022-06-13 14:23                       ` Aneesh Kumar K V
  2022-06-14 16:45                       ` Jonathan Cameron
  0 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2022-06-13 14:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> On Thu, 9 Jun 2022 16:41:04 -0400
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > Would it make more sense to have the platform/devicetree/driver
> > provide more fine-grained distance values similar to NUMA distances,
> > and have a driver-scope tunable to override/correct? And then have the
> > distance value function as the unique tier ID and rank in one.
> 
> Absolutely a good thing to provide that information, but it's black
> magic. There are too many contradicting metrics (latency vs bandwidth etc)
> even not including a more complex system model like Jerome Glisse proposed
> a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
> CXL 2.0 got this more right than anything else I've seen as provides
> discoverable topology along with details like latency to cross between
> particular switch ports.  Actually using that data (other than by throwing
> it to userspace controls for HPC apps etc) is going to take some figuring out.
> Even the question of what + how we expose this info to userspace is non
> obvious.

Right, I don't think those would be scientifically accurate - but
neither is a number between 1 and 3. The way I look at it is more
about spreading out the address space a bit, to allow expressing
nuanced differences without risking conflicts and overlaps. Hopefully
this results in the shipped values stabilizing over time and thus
requiring less and less intervention and overriding from userspace.

> > Going further, it could be useful to separate the business of hardware
> > properties (and configuring quirks) from the business of configuring
> > MM policies that should be applied to the resulting tier hierarchy.
> > They're somewhat orthogonal tuning tasks, and one of them might become
> > obsolete before the other (if the quality of distance values provided
> > by drivers improves before the quality of MM heuristics ;). Separating
> > them might help clarify the interface for both designers and users.
> > 
> > E.g. a memdev class scope with a driver-wide distance value, and a
> > memdev scope for per-device values that default to "inherit driver
> > value". The memtier subtree would then have an r/o structure, but
> > allow tuning per-tier interleaving ratio[1], demotion rules etc.
> 
> Ok that makes sense.  I'm not sure if that ends up as an implementation
> detail, or effects the userspace interface of this particular element.
> 
> I'm not sure completely read only is flexible enough (though mostly RO is fine)
> as we keep sketching out cases where any attempt to do things automatically
> does the wrong thing and where we need to add an extra tier to get
> everything to work.  Short of having a lot of tiers I'm not sure how
> we could have the default work well.  Maybe a lot of "tiers" is fine
> though perhaps we need to rename them if going this way and then they
> don't really work as current concept of tier.
> 
> Imagine a system with subtle difference between different memories such
> as 10% latency increase for same bandwidth.  To get an advantage from
> demoting to such a tier will require really stable usage and long
> run times. Whilst you could design a demotion scheme that takes that
> into account, I think we are a long way from that today.

Good point: there can be a clear hardware difference, but it's a
policy choice whether the MM should treat them as one or two tiers.

What do you think of a per-driver/per-device (overridable) distance
number, combined with a configurable distance cutoff for what
constitutes separate tiers. E.g. cutoff=20 means two devices with
distances of 10 and 20 respectively would be in the same tier, devices
with 10 and 100 would be in separate ones. The kernel then generates
and populates the tiers based on distances and grouping cutoff, and
populates the memtier directory tree and nodemasks in sysfs.

It could be simple tier0, tier1, tier2 numbering again, but the
numbers now would mean something to the user. A rank tunable is no
longer necessary.

I think even the nodemasks in the memtier tree could be read-only
then, since corrections should only be necessary when either the
device distance is wrong or the tier grouping cutoff.

Can you think of scenarios where that scheme would fall apart?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-13 14:05                     ` Johannes Weiner
@ 2022-06-13 14:23                       ` Aneesh Kumar K V
  2022-06-13 15:50                         ` Johannes Weiner
  2022-06-14 16:45                       ` Jonathan Cameron
  1 sibling, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-13 14:23 UTC (permalink / raw)
  To: Johannes Weiner, Jonathan Cameron
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 6/13/22 7:35 PM, Johannes Weiner wrote:
> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
>>

....

>> I'm not sure completely read only is flexible enough (though mostly RO is fine)
>> as we keep sketching out cases where any attempt to do things automatically
>> does the wrong thing and where we need to add an extra tier to get
>> everything to work.  Short of having a lot of tiers I'm not sure how
>> we could have the default work well.  Maybe a lot of "tiers" is fine
>> though perhaps we need to rename them if going this way and then they
>> don't really work as current concept of tier.
>>
>> Imagine a system with subtle difference between different memories such
>> as 10% latency increase for same bandwidth.  To get an advantage from
>> demoting to such a tier will require really stable usage and long
>> run times. Whilst you could design a demotion scheme that takes that
>> into account, I think we are a long way from that today.
> 
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
> 
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.
> 

Right now core/generic code doesn't get involved in building tiers. It 
just defines three tiers where drivers could place the respective 
devices they manage. The above suggestion would imply we are moving 
quite a lot of policy decision logic into the generic code?.

At some point, we will have to depend on more attributes other than 
distance(may be HMAT?) and each driver should have the flexibility to 
place the device it is managing in a specific tier? By then we may 
decide to support more than 3 static tiers which the core kernel 
currently does.

If the kernel still can't make the right decision, userspace could 
rearrange them in any order using rank values. Without something like 
rank, if userspace needs to fix things up,  it gets hard with device
hotplugging. ie, the userspace policy could be that any new PMEM tier 
device that is hotplugged, park it with a very low-rank value and hence 
lowest in demotion order by default. (echo 10 > 
/sys/devices/system/memtier/memtier2/rank) . After that userspace could 
selectively move the new devices to the correct memory tier?


> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.
> 
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
> 
> Can you think of scenarios where that scheme would fall apart?

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-13 14:23                       ` Aneesh Kumar K V
@ 2022-06-13 15:50                         ` Johannes Weiner
  2022-06-14  6:48                           ` Ying Huang
  2022-06-14  8:01                           ` Aneesh Kumar K V
  0 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2022-06-13 15:50 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> On 6/13/22 7:35 PM, Johannes Weiner wrote:
> > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > > as we keep sketching out cases where any attempt to do things automatically
> > > does the wrong thing and where we need to add an extra tier to get
> > > everything to work.  Short of having a lot of tiers I'm not sure how
> > > we could have the default work well.  Maybe a lot of "tiers" is fine
> > > though perhaps we need to rename them if going this way and then they
> > > don't really work as current concept of tier.
> > > 
> > > Imagine a system with subtle difference between different memories such
> > > as 10% latency increase for same bandwidth.  To get an advantage from
> > > demoting to such a tier will require really stable usage and long
> > > run times. Whilst you could design a demotion scheme that takes that
> > > into account, I think we are a long way from that today.
> > 
> > Good point: there can be a clear hardware difference, but it's a
> > policy choice whether the MM should treat them as one or two tiers.
> > 
> > What do you think of a per-driver/per-device (overridable) distance
> > number, combined with a configurable distance cutoff for what
> > constitutes separate tiers. E.g. cutoff=20 means two devices with
> > distances of 10 and 20 respectively would be in the same tier, devices
> > with 10 and 100 would be in separate ones. The kernel then generates
> > and populates the tiers based on distances and grouping cutoff, and
> > populates the memtier directory tree and nodemasks in sysfs.
> > 
> 
> Right now core/generic code doesn't get involved in building tiers. It just
> defines three tiers where drivers could place the respective devices they
> manage. The above suggestion would imply we are moving quite a lot of policy
> decision logic into the generic code?.

No. The driver still chooses its own number, just from a wider
range. The only policy in generic code is the distance cutoff for
which devices are grouped into tiers together.

> At some point, we will have to depend on more attributes other than
> distance(may be HMAT?) and each driver should have the flexibility to place
> the device it is managing in a specific tier? By then we may decide to
> support more than 3 static tiers which the core kernel currently does.

If we start with a larger possible range of "distance" values right
away, we can still let the drivers ballpark into 3 tiers for now (100,
200, 300). But it will be easier to take additional metrics into
account later and fine tune accordingly (120, 260, 90 etc.) without
having to update all the other drivers as well.

> If the kernel still can't make the right decision, userspace could rearrange
> them in any order using rank values. Without something like rank, if
> userspace needs to fix things up,  it gets hard with device
> hotplugging. ie, the userspace policy could be that any new PMEM tier device
> that is hotplugged, park it with a very low-rank value and hence lowest in
> demotion order by default. (echo 10 >
> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> selectively move the new devices to the correct memory tier?

I had touched on this in the other email.

This doesn't work if two drivers that should have separate policies
collide into the same tier - which is very likely with just 3 tiers.
So it seems to me the main usecase for having a rank tunable falls
apart rather quickly until tiers are spaced out more widely. And it
does so at the cost of an, IMO, tricky to understand interface.

In the other email I had suggested the ability to override not just
the per-device distance, but also the driver default for new devices
to handle the hotplug situation.

This should be less policy than before. Driver default and per-device
distances (both overridable) combined with one tunable to set the
range of distances that get grouped into tiers.

With these parameters alone, you can generate an ordered list of tiers
and their devices. The tier numbers make sense, and no rank is needed.

Do you still need the ability to move nodes by writing nodemasks? I
don't think so. Assuming you would never want to have an actually
slower device in a higher tier than a faster device, the only time
you'd want to move a device is when the device's distance value is
wrong. So you override that (until you update to a fixed kernel).

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-13 15:50                         ` Johannes Weiner
@ 2022-06-14  6:48                           ` Ying Huang
  2022-06-14  8:01                           ` Aneesh Kumar K V
  1 sibling, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-14  6:48 UTC (permalink / raw)
  To: Johannes Weiner, Aneesh Kumar K V
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Mon, 2022-06-13 at 11:50 -0400, Johannes Weiner wrote:
> On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > On 6/13/22 7:35 PM, Johannes Weiner wrote:
> > > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > > > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > > > as we keep sketching out cases where any attempt to do things automatically
> > > > does the wrong thing and where we need to add an extra tier to get
> > > > everything to work.  Short of having a lot of tiers I'm not sure how
> > > > we could have the default work well.  Maybe a lot of "tiers" is fine
> > > > though perhaps we need to rename them if going this way and then they
> > > > don't really work as current concept of tier.
> > > > 
> > > > Imagine a system with subtle difference between different memories such
> > > > as 10% latency increase for same bandwidth.  To get an advantage from
> > > > demoting to such a tier will require really stable usage and long
> > > > run times. Whilst you could design a demotion scheme that takes that
> > > > into account, I think we are a long way from that today.
> > > 
> > > Good point: there can be a clear hardware difference, but it's a
> > > policy choice whether the MM should treat them as one or two tiers.
> > > 
> > > What do you think of a per-driver/per-device (overridable) distance
> > > number, combined with a configurable distance cutoff for what
> > > constitutes separate tiers. E.g. cutoff=20 means two devices with
> > > distances of 10 and 20 respectively would be in the same tier, devices
> > > with 10 and 100 would be in separate ones. The kernel then generates
> > > and populates the tiers based on distances and grouping cutoff, and
> > > populates the memtier directory tree and nodemasks in sysfs.
> > > 
> > 
> > Right now core/generic code doesn't get involved in building tiers. It just
> > defines three tiers where drivers could place the respective devices they
> > manage. The above suggestion would imply we are moving quite a lot of policy
> > decision logic into the generic code?.
> 
> No. The driver still chooses its own number, just from a wider
> range. The only policy in generic code is the distance cutoff for
> which devices are grouped into tiers together.
> 
> > At some point, we will have to depend on more attributes other than
> > distance(may be HMAT?) and each driver should have the flexibility to place
> > the device it is managing in a specific tier? By then we may decide to
> > support more than 3 static tiers which the core kernel currently does.
> 
> If we start with a larger possible range of "distance" values right
> away, we can still let the drivers ballpark into 3 tiers for now (100,
> 200, 300). But it will be easier to take additional metrics into
> account later and fine tune accordingly (120, 260, 90 etc.) without
> having to update all the other drivers as well.
> 
> > If the kernel still can't make the right decision, userspace could rearrange
> > them in any order using rank values. Without something like rank, if
> > userspace needs to fix things up,  it gets hard with device
> > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > that is hotplugged, park it with a very low-rank value and hence lowest in
> > demotion order by default. (echo 10 >
> > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > selectively move the new devices to the correct memory tier?
> 
> I had touched on this in the other email.
> 
> This doesn't work if two drivers that should have separate policies
> collide into the same tier - which is very likely with just 3 tiers.
> So it seems to me the main usecase for having a rank tunable falls
> apart rather quickly until tiers are spaced out more widely. And it
> does so at the cost of an, IMO, tricky to understand interface.
> 
> In the other email I had suggested the ability to override not just
> the per-device distance, but also the driver default for new devices
> to handle the hotplug situation.
> 
> This should be less policy than before. Driver default and per-device
> distances (both overridable) combined with one tunable to set the
> range of distances that get grouped into tiers.
> 
> With these parameters alone, you can generate an ordered list of tiers
> and their devices. The tier numbers make sense, and no rank is needed.
> 
> Do you still need the ability to move nodes by writing nodemasks? I
> don't think so. Assuming you would never want to have an actually
> slower device in a higher tier than a faster device, the only time
> you'd want to move a device is when the device's distance value is
> wrong. So you override that (until you update to a fixed kernel).

This sounds good to me.  In this way, we override driver parameter
instead of memory tiers itself.  So I guess when we do that, the memory
tier of the NUMA nodes controlled by the driver will be changed.  Or all
memory tiers will be regenerated?

I have a suggestion.  Instead of abstract distance number, how about
using memory latency and bandwidth directly?  These can be gotten from
HMAT directly when necessary.  Even if they are not available directly,
they may be tested at runtime by the drivers.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-13 15:50                         ` Johannes Weiner
  2022-06-14  6:48                           ` Ying Huang
@ 2022-06-14  8:01                           ` Aneesh Kumar K V
  2022-06-14 18:56                             ` Johannes Weiner
  1 sibling, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-14  8:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/13/22 9:20 PM, Johannes Weiner wrote:
> On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
>> On 6/13/22 7:35 PM, Johannes Weiner wrote:
>>> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
>>>> I'm not sure completely read only is flexible enough (though mostly RO is fine)
>>>> as we keep sketching out cases where any attempt to do things automatically
>>>> does the wrong thing and where we need to add an extra tier to get
>>>> everything to work.  Short of having a lot of tiers I'm not sure how
>>>> we could have the default work well.  Maybe a lot of "tiers" is fine
>>>> though perhaps we need to rename them if going this way and then they
>>>> don't really work as current concept of tier.
>>>>
>>>> Imagine a system with subtle difference between different memories such
>>>> as 10% latency increase for same bandwidth.  To get an advantage from
>>>> demoting to such a tier will require really stable usage and long
>>>> run times. Whilst you could design a demotion scheme that takes that
>>>> into account, I think we are a long way from that today.
>>>
>>> Good point: there can be a clear hardware difference, but it's a
>>> policy choice whether the MM should treat them as one or two tiers.
>>>
>>> What do you think of a per-driver/per-device (overridable) distance
>>> number, combined with a configurable distance cutoff for what
>>> constitutes separate tiers. E.g. cutoff=20 means two devices with
>>> distances of 10 and 20 respectively would be in the same tier, devices
>>> with 10 and 100 would be in separate ones. The kernel then generates
>>> and populates the tiers based on distances and grouping cutoff, and
>>> populates the memtier directory tree and nodemasks in sysfs.
>>>
>>
>> Right now core/generic code doesn't get involved in building tiers. It just
>> defines three tiers where drivers could place the respective devices they
>> manage. The above suggestion would imply we are moving quite a lot of policy
>> decision logic into the generic code?.
> 
> No. The driver still chooses its own number, just from a wider
> range. The only policy in generic code is the distance cutoff for
> which devices are grouped into tiers together.
> 
>> At some point, we will have to depend on more attributes other than
>> distance(may be HMAT?) and each driver should have the flexibility to place
>> the device it is managing in a specific tier? By then we may decide to
>> support more than 3 static tiers which the core kernel currently does.
> 
> If we start with a larger possible range of "distance" values right
> away, we can still let the drivers ballpark into 3 tiers for now (100,
> 200, 300). But it will be easier to take additional metrics into
> account later and fine tune accordingly (120, 260, 90 etc.) without
> having to update all the other drivers as well.
> 
>> If the kernel still can't make the right decision, userspace could rearrange
>> them in any order using rank values. Without something like rank, if
>> userspace needs to fix things up,  it gets hard with device
>> hotplugging. ie, the userspace policy could be that any new PMEM tier device
>> that is hotplugged, park it with a very low-rank value and hence lowest in
>> demotion order by default. (echo 10 >
>> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
>> selectively move the new devices to the correct memory tier?
> 
> I had touched on this in the other email.
> 
> This doesn't work if two drivers that should have separate policies
> collide into the same tier - which is very likely with just 3 tiers.
> So it seems to me the main usecase for having a rank tunable falls
> apart rather quickly until tiers are spaced out more widely. And it
> does so at the cost of an, IMO, tricky to understand interface.
> 

Considering the kernel has a static map for these tiers, how can two drivers
end up using the same tier? If a new driver is going to manage a memory
device that is of different characteristics than the one managed by dax/kmem,
we will end up adding 

#define MEMORY_TIER_NEW_DEVICE 4

The new driver will never use MEMORY_TIER_PMEM

What can happen is two devices that are managed by DAX/kmem that
should be in two memory tiers get assigned the same memory tier
because the dax/kmem driver added both the device to the same memory tier.

In the future we would avoid that by using more device properties like HMAT
to create additional memory tiers with different rank values. ie, we would
do in the dax/kmem create_tier_from_rank() .


> In the other email I had suggested the ability to override not just
> the per-device distance, but also the driver default for new devices
> to handle the hotplug situation.
> 

I understand that the driver override will be done via module parameters.
How will we implement device override? For example in case of dax/kmem driver
the device override will be per dax device? What interface will we use to set the override? 

IIUC in the above proposal the dax/kmem will do

node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));

get_device_tier_index(struct dev_dax *dev)
{
    return dax_kmem_tier_index; // module parameter
}

Are you suggesting to add a dev_dax property to override the tier defaults? 

> This should be less policy than before. Driver default and per-device
> distances (both overridable) combined with one tunable to set the
> range of distances that get grouped into tiers.
> 

Can you elaborate more on how distance value will be used? The device/device NUMA node can have
different distance value from other NUMA nodes. How do we group them?
for ex: earlier discussion did outline three different topologies. Can you
ellaborate how we would end up grouping them using distance?

For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
so how will we classify node 2?


Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

		  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |        \   /       |
       | 30    40 X 40      | 30
       |        /   \       |
  Node 2 (PMEM)  ----  Node 3 (PMEM)
		  40

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

		  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |            /
       | 30       / 40
       |        /
  Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10


Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is a large, slow DRAM node without CPU.

		    100
     Node 0 (DRAM)  ----  Node 1 (GPU)
    /     |               /    |
   /40    |30        120 /     | 110
  |       |             /      |
  |  Node 2 (PMEM) ----       /
  |        \                 /
   \     80 \               /
    ------- Node 3 (Slow DRAM)

node distances:
node    0    1    2    3
   0   10  100   30   40
   1  100   10  120  110
   2   30  120   10   80
   3   40  110   80   10

> With these parameters alone, you can generate an ordered list of tiers
> and their devices. The tier numbers make sense, and no rank is needed.
> 
> Do you still need the ability to move nodes by writing nodemasks? I
> don't think so. Assuming you would never want to have an actually
> slower device in a higher tier than a faster device, the only time
> you'd want to move a device is when the device's distance value is
> wrong. So you override that (until you update to a fixed kernel).


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-13 14:05                     ` Johannes Weiner
  2022-06-13 14:23                       ` Aneesh Kumar K V
@ 2022-06-14 16:45                       ` Jonathan Cameron
  2022-06-21  8:27                         ` Aneesh Kumar K V
  1 sibling, 1 reply; 84+ messages in thread
From: Jonathan Cameron @ 2022-06-14 16:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Aneesh Kumar K V, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 13 Jun 2022 10:05:06 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > On Thu, 9 Jun 2022 16:41:04 -0400
> > Johannes Weiner <hannes@cmpxchg.org> wrote:  
> > > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > > Would it make more sense to have the platform/devicetree/driver
> > > provide more fine-grained distance values similar to NUMA distances,
> > > and have a driver-scope tunable to override/correct? And then have the
> > > distance value function as the unique tier ID and rank in one.  
> > 
> > Absolutely a good thing to provide that information, but it's black
> > magic. There are too many contradicting metrics (latency vs bandwidth etc)
> > even not including a more complex system model like Jerome Glisse proposed
> > a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
> > CXL 2.0 got this more right than anything else I've seen as provides
> > discoverable topology along with details like latency to cross between
> > particular switch ports.  Actually using that data (other than by throwing
> > it to userspace controls for HPC apps etc) is going to take some figuring out.
> > Even the question of what + how we expose this info to userspace is non
> > obvious.  

Was offline for a few days.  At risk of splitting a complex thread
even more....

> 
> Right, I don't think those would be scientifically accurate - but
> neither is a number between 1 and 3.

The 3 tiers in this proposal are just a starting point (and one I'd
expect we'll move beyond very quickly) - aim is to define a userspace
that is flexible enough, but then only use a tiny bit of that flexibility
to get an initial version in place.  Even relatively trivial CXL systems
will include.

1) Direct connected volatile memory, (similar to a memory only NUMA node / socket)
2) Direct connected non volatile (similar to pmem Numa node, but maybe not
   similar enough to fuse with socket connected pmem)
3) Switch connected volatile memory (typically a disagregated memory device,
   so huge, high bandwidth, not great latency)
4) Switch connected non volatile (typically huge, high bandwidth, even wors
   latency).
5) Much more fun if we care about bandwidth as interleaving going on
   in hardware across either similar, or mixed sets of switch connected
   and direct connected.

Sure we might fuse some of those.  But just the CXL driver is likely to have
groups separate enough we want to handle them as 4 tiers and migrate between
those tiers...  Obviously might want a clever strategy for cold / hot migration!

> The way I look at it is more
> about spreading out the address space a bit, to allow expressing
> nuanced differences without risking conflicts and overlaps. Hopefully
> this results in the shipped values stabilizing over time and thus
> requiring less and less intervention and overriding from userspace.

I don't think they ever will stabilize, because the right answer isn't
definable in terms of just one number.  We'll end up with the old mess of
magic values in SLIT in which systems have been tuned against particular
use cases. HMAT was meant to solve that, but not yet clear it it will.

> 
> > > Going further, it could be useful to separate the business of hardware
> > > properties (and configuring quirks) from the business of configuring
> > > MM policies that should be applied to the resulting tier hierarchy.
> > > They're somewhat orthogonal tuning tasks, and one of them might become
> > > obsolete before the other (if the quality of distance values provided
> > > by drivers improves before the quality of MM heuristics ;). Separating
> > > them might help clarify the interface for both designers and users.
> > > 
> > > E.g. a memdev class scope with a driver-wide distance value, and a
> > > memdev scope for per-device values that default to "inherit driver
> > > value". The memtier subtree would then have an r/o structure, but
> > > allow tuning per-tier interleaving ratio[1], demotion rules etc.  
> > 
> > Ok that makes sense.  I'm not sure if that ends up as an implementation
> > detail, or effects the userspace interface of this particular element.
> > 
> > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > as we keep sketching out cases where any attempt to do things automatically
> > does the wrong thing and where we need to add an extra tier to get
> > everything to work.  Short of having a lot of tiers I'm not sure how
> > we could have the default work well.  Maybe a lot of "tiers" is fine
> > though perhaps we need to rename them if going this way and then they
> > don't really work as current concept of tier.
> > 
> > Imagine a system with subtle difference between different memories such
> > as 10% latency increase for same bandwidth.  To get an advantage from
> > demoting to such a tier will require really stable usage and long
> > run times. Whilst you could design a demotion scheme that takes that
> > into account, I think we are a long way from that today.  
> 
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
> 
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.

I think we'll need something along those lines, though I was envisioning
it sitting at the level of what we do with the tiers, rather than how
we create them.  So particularly usecases would decide to treat
sets of tiers as if they were one.  Have enough tiers and we'll end up
with k-means or similar to figure out the groupings. Of course there
is then a soft of 'tier group for use XX' concept so maybe not much
difference until we have a bunch of usecases.

> 
> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.

This feels like it might make tier assignments a bit less stable
and hence run into question of how to hook up accounting. Not my
area of expertise though, but it was put forward as one of the reasons
we didn't want hotplug to potentially end up shuffling other tiers
around.  The desire was for a 'stable' entity.  Can avoid that with
'space' between them but then we sort of still have rank, just in a
form that makes updating it messy (need to create a new tier to do
it).

> 
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
> 
> Can you think of scenarios where that scheme would fall apart?

Simplest (I think) is the GPU one. Often those have very nice
memory that we CPU software developers would love to use, but
some pesky GPGPU folk think it is for GPU related data. Anyhow, folk
who care about GPUs have requested that it be in a tier that
is lower rank than main memory.

If you just categorize it by performance (from CPUs) then it
might well end up elsewhere.  These folk do want to demote
to CPU attached DRAM though.  Which raises the question of
'where is your distance between?'

Definitely policy decision, and one we can't get from perf
characteristics.  It's a blurry line. There are classes
of fairly low spec memory attached accelerators on the horizon.
For those preventing migration to the memory they are associated
with might generally not make sense.

Tweaking policy by messing with anything that claims to be a
distance is a bit nasty at looks like the SLIT table tuning
that's still happens. Could have a per device rank though
and make it clear this isn't cleanly related to any perf
characterstics.  So ultimately that moves rank to devices
and then we have to put them into nodes. Not sure it gained
us much other than seeming more complex to me.

Jonathan



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-14  8:01                           ` Aneesh Kumar K V
@ 2022-06-14 18:56                             ` Johannes Weiner
  2022-06-15  6:23                               ` Aneesh Kumar K V
  2022-06-16  1:11                               ` Ying Huang
  0 siblings, 2 replies; 84+ messages in thread
From: Johannes Weiner @ 2022-06-14 18:56 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> >> If the kernel still can't make the right decision, userspace could rearrange
> >> them in any order using rank values. Without something like rank, if
> >> userspace needs to fix things up,  it gets hard with device
> >> hotplugging. ie, the userspace policy could be that any new PMEM tier device
> >> that is hotplugged, park it with a very low-rank value and hence lowest in
> >> demotion order by default. (echo 10 >
> >> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> >> selectively move the new devices to the correct memory tier?
> > 
> > I had touched on this in the other email.
> > 
> > This doesn't work if two drivers that should have separate policies
> > collide into the same tier - which is very likely with just 3 tiers.
> > So it seems to me the main usecase for having a rank tunable falls
> > apart rather quickly until tiers are spaced out more widely. And it
> > does so at the cost of an, IMO, tricky to understand interface.
> > 
> 
> Considering the kernel has a static map for these tiers, how can two drivers
> end up using the same tier? If a new driver is going to manage a memory
> device that is of different characteristics than the one managed by dax/kmem,
> we will end up adding 
> 
> #define MEMORY_TIER_NEW_DEVICE 4
> 
> The new driver will never use MEMORY_TIER_PMEM
> 
> What can happen is two devices that are managed by DAX/kmem that
> should be in two memory tiers get assigned the same memory tier
> because the dax/kmem driver added both the device to the same memory tier.
> 
> In the future we would avoid that by using more device properties like HMAT
> to create additional memory tiers with different rank values. ie, we would
> do in the dax/kmem create_tier_from_rank() .

Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
DRAMs of different speeds etc.

I also like Huang's idea of using latency characteristics instead of
abstract distances. Though I'm not quite sure how feasible this is in
the short term, and share some concerns that Jonathan raised. But I
think a wider possible range to begin with makes sense in any case.

> > In the other email I had suggested the ability to override not just
> > the per-device distance, but also the driver default for new devices
> > to handle the hotplug situation.
> > 
> 
> I understand that the driver override will be done via module parameters.
> How will we implement device override? For example in case of dax/kmem driver
> the device override will be per dax device? What interface will we use to set the override?
> 
> IIUC in the above proposal the dax/kmem will do
> 
> node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> 
> get_device_tier_index(struct dev_dax *dev)
> {
>     return dax_kmem_tier_index; // module parameter
> }
> 
> Are you suggesting to add a dev_dax property to override the tier defaults?

I was thinking a new struct memdevice and struct memtype(?). Every
driver implementing memory devices like this sets those up and
registers them with generic code and preset parameters. The generic
code creates sysfs directories and allows overriding the parameters.

struct memdevice {
	struct device dev;
	unsigned long distance;
	struct list_head siblings;
	/* nid? ... */
};

struct memtype {
	struct device_type type;
	unsigned long default_distance;
	struct list_head devices;
};

That forms the (tweakable) tree describing physical properties.

From that, the kernel then generates the ordered list of tiers.

> > This should be less policy than before. Driver default and per-device
> > distances (both overridable) combined with one tunable to set the
> > range of distances that get grouped into tiers.
> > 
> 
> Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> different distance value from other NUMA nodes. How do we group them?
> for ex: earlier discussion did outline three different topologies. Can you
> ellaborate how we would end up grouping them using distance?
> 
> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> so how will we classify node 2?
> 
> 
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> 
> 		  20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>        |        \   /       |
>        | 30    40 X 40      | 30
>        |        /   \       |
>   Node 2 (PMEM)  ----  Node 3 (PMEM)
> 		  40
> 
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10

I'm fairly confused by this example. Do all nodes have CPUs? Isn't
this just classic NUMA, where optimizing for locality makes the most
sense, rather than tiering?

Forget the interface for a second, I have no idea how tiering on such
a system would work. One CPU's lower tier can be another CPU's
toptier. There is no lowest rung from which to actually *reclaim*
pages. Would the CPUs just demote in circles?

And the coldest pages on one socket would get demoted into another
socket and displace what that socket considers hot local memory?

I feel like I missing something.

When we're talking about tiered memory, I'm thinking about CPUs
utilizing more than one memory node. If those other nodes have CPUs,
you can't reliably establish a singular tier order anymore and it
becomes classic NUMA, no?

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-14 18:56                             ` Johannes Weiner
@ 2022-06-15  6:23                               ` Aneesh Kumar K V
  2022-06-16  1:11                               ` Ying Huang
  1 sibling, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-15  6:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Huang Ying,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/15/22 12:26 AM, Johannes Weiner wrote:

....

>> What can happen is two devices that are managed by DAX/kmem that
>> should be in two memory tiers get assigned the same memory tier
>> because the dax/kmem driver added both the device to the same memory tier.
>>
>> In the future we would avoid that by using more device properties like HMAT
>> to create additional memory tiers with different rank values. ie, we would
>> do in the dax/kmem create_tier_from_rank() .
> 
> Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> DRAMs of different speeds etc.
> 
> I also like Huang's idea of using latency characteristics instead of
> abstract distances. Though I'm not quite sure how feasible this is in
> the short term, and share some concerns that Jonathan raised. But I
> think a wider possible range to begin with makes sense in any case.
> 

How about the below proposal? 

In this proposal, we use the tier ID as the value that determines the position
of the memory tier in the demotion order. A higher value of tier ID indicates a
higher memory tier. Memory demotion happens from a higher memory tier to a lower
memory tier.

By default memory get hotplugged into 'default_memory_tier' . There is a core
kernel parameter "default_memory_tier" which can be updated if the user wants to
modify the default tier ID.

dax/kmem driver use the "dax_kmem_memtier" module parameter to determine the
memory tier to which DAX/kmem memory will be added.

dax_kmem_memtier and default_memtier defaults to 100 and 200 respectively.

Later as we update dax/kmem to use additional device attributes, the driver will
be able to place new devices in different memory tiers. As we do that, it is
expected that users will have the ability to override these device attribute and
control which memory tiers the devices will be placed.

New memory tiers can also be created by using node/memtier attribute.
Moving a NUMA node to a non-existing memory tier results in creating
new memory tiers. So if the kernel default placement of memory devices
in memory tiers is not preferred, userspace could choose to create a
completely new memory tier hierarchy using this interface. Memory tiers
get deleted when they ends up with empty nodelist.

# cat /sys/module/kernel/parameters/default_memory_tier 
200
# cat /sys/module/kmem/parameters/dax_kmem_memtier 
100

# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier200  power  uevent
# ls /sys/devices/system/memtier/memtier200/nodelist 
/sys/devices/system/memtier/memtier200/nodelist
# cat  /sys/devices/system/memtier/memtier200/nodelist 
1-3
# echo 20 > /sys/devices/system/node/node1/memtier 
# 
# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier20  memtier200  power  uevent
# cat  /sys/devices/system/memtier/memtier20/nodelist 
1
# 

# echo 10 > /sys/module/kmem/parameters/dax_kmem_memtier 
# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind 
# echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id 
# 
# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier10  memtier20  memtier200  power  uevent
# cat  /sys/devices/system/memtier/memtier10/nodelist 
4
# 

# grep . /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier10/nodelist:4
/sys/devices/system/memtier/memtier200/nodelist:2-3
/sys/devices/system/memtier/memtier20/nodelist:1

demotion order details for the above will be
lower tier mask for node 1 is 4 and preferred demotion node is  4
lower tier mask for node 2 is 1,4 and preferred demotion node is 1
lower tier mask for node 3 is 1,4 and preferred demotion node is 1
lower tier mask for node 4  None

:/sys/devices/system/memtier#  ls
default_tier  max_tier  memtier10  memtier20  memtier200  power  uevent
:/sys/devices/system/memtier#  cat memtier20/nodelist 
1
:/sys/devices/system/memtier# echo 200 > ../node/node1/memtier 
:/sys/devices/system/memtier# ls
default_tier  max_tier  memtier10  memtier200  power  uevent
:/sys/devices/system/memtier# 




>>> In the other email I had suggested the ability to override not just
>>> the per-device distance, but also the driver default for new devices
>>> to handle the hotplug situation.
>>>

.....

>>
>> Can you elaborate more on how distance value will be used? The device/device NUMA node can have
>> different distance value from other NUMA nodes. How do we group them?
>> for ex: earlier discussion did outline three different topologies. Can you
>> ellaborate how we would end up grouping them using distance?
>>
>> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
>> so how will we classify node 2?
>>
>>
>> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>
>> 		  20
>>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>>        |        \   /       |
>>        | 30    40 X 40      | 30
>>        |        /   \       |
>>   Node 2 (PMEM)  ----  Node 3 (PMEM)
>> 		  40
>>
>> node distances:
>> node   0    1    2    3
>>    0  10   20   30   40
>>    1  20   10   40   30
>>    2  30   40   10   40
>>    3  40   30   40   10
> 
> I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> this just classic NUMA, where optimizing for locality makes the most
> sense, rather than tiering?
> 

Node 2 and Node3 will be memory only NUMA nodes.

> Forget the interface for a second, I have no idea how tiering on such
> a system would work. One CPU's lower tier can be another CPU's
> toptier. There is no lowest rung from which to actually *reclaim*
> pages. Would the CPUs just demote in circles?
> 
> And the coldest pages on one socket would get demoted into another
> socket and displace what that socket considers hot local memory?
> 
> I feel like I missing something.
> 
> When we're talking about tiered memory, I'm thinking about CPUs
> utilizing more than one memory node. If those other nodes have CPUs,
> you can't reliably establish a singular tier order anymore and it
> becomes classic NUMA, no?


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-14 18:56                             ` Johannes Weiner
  2022-06-15  6:23                               ` Aneesh Kumar K V
@ 2022-06-16  1:11                               ` Ying Huang
  2022-06-16  3:45                                 ` Wei Xu
  2022-06-17 10:41                                 ` Jonathan Cameron
  1 sibling, 2 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-16  1:11 UTC (permalink / raw)
  To: Johannes Weiner, Aneesh Kumar K V
  Cc: Jonathan Cameron, linux-mm, akpm, Wei Xu, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> > On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > them in any order using rank values. Without something like rank, if
> > > > userspace needs to fix things up,  it gets hard with device
> > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > demotion order by default. (echo 10 >
> > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > selectively move the new devices to the correct memory tier?
> > > 
> > > I had touched on this in the other email.
> > > 
> > > This doesn't work if two drivers that should have separate policies
> > > collide into the same tier - which is very likely with just 3 tiers.
> > > So it seems to me the main usecase for having a rank tunable falls
> > > apart rather quickly until tiers are spaced out more widely. And it
> > > does so at the cost of an, IMO, tricky to understand interface.
> > > 
> > 
> > Considering the kernel has a static map for these tiers, how can two drivers
> > end up using the same tier? If a new driver is going to manage a memory
> > device that is of different characteristics than the one managed by dax/kmem,
> > we will end up adding 
> > 
> > #define MEMORY_TIER_NEW_DEVICE 4
> > 
> > The new driver will never use MEMORY_TIER_PMEM
> > 
> > What can happen is two devices that are managed by DAX/kmem that
> > should be in two memory tiers get assigned the same memory tier
> > because the dax/kmem driver added both the device to the same memory tier.
> > 
> > In the future we would avoid that by using more device properties like HMAT
> > to create additional memory tiers with different rank values. ie, we would
> > do in the dax/kmem create_tier_from_rank() .
> 
> Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> DRAMs of different speeds etc.
> 
> I also like Huang's idea of using latency characteristics instead of
> abstract distances. Though I'm not quite sure how feasible this is in
> the short term, and share some concerns that Jonathan raised. But I
> think a wider possible range to begin with makes sense in any case.
> 
> > > In the other email I had suggested the ability to override not just
> > > the per-device distance, but also the driver default for new devices
> > > to handle the hotplug situation.
> > > 
> > 
> > I understand that the driver override will be done via module parameters.
> > How will we implement device override? For example in case of dax/kmem driver
> > the device override will be per dax device? What interface will we use to set the override?
> > 
> > IIUC in the above proposal the dax/kmem will do
> > 
> > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > 
> > get_device_tier_index(struct dev_dax *dev)
> > {
> >     return dax_kmem_tier_index; // module parameter
> > }
> > 
> > Are you suggesting to add a dev_dax property to override the tier defaults?
> 
> I was thinking a new struct memdevice and struct memtype(?). Every
> driver implementing memory devices like this sets those up and
> registers them with generic code and preset parameters. The generic
> code creates sysfs directories and allows overriding the parameters.
> 
> struct memdevice {
> 	struct device dev;
> 	unsigned long distance;
> 	struct list_head siblings;
> 	/* nid? ... */
> };
> 
> struct memtype {
> 	struct device_type type;
> 	unsigned long default_distance;
> 	struct list_head devices;
> };
> 
> That forms the (tweakable) tree describing physical properties.

In general, I think memtype is a good idea.  I have suggested
something similar before.  It can describe the characters of a
specific type of memory (same memory media with different interface
(e.g., CXL, or DIMM) will be different memory types).  And they can
be used to provide overriding information.

As for memdevice, I think that we already have "node" to represent
them in sysfs.  Do we really need another one?  Is it sufficient to
add some links to node in the appropriate directory?  For example,
make memtype class device under the physical device (e.g. CXL device),
and create links to node inside the memtype class device directory?

> From that, the kernel then generates the ordered list of tiers.

As Jonathan Cameron pointed, we may need the memory tier ID to be
stable if possible.  I know this isn't a easy task.  At least we can
make the default memory tier (CPU local DRAM) ID stable (for example
make it always 128)?  That provides an anchor for users to understand.

Best Regards,
Huang, Ying

> > > This should be less policy than before. Driver default and per-device
> > > distances (both overridable) combined with one tunable to set the
> > > range of distances that get grouped into tiers.
> > > 
> > 
> > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > different distance value from other NUMA nodes. How do we group them?
> > for ex: earlier discussion did outline three different topologies. Can you
> > ellaborate how we would end up grouping them using distance?
> > 
> > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > so how will we classify node 2?
> > 
> > 
> > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > 
> > 		  20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >        |        \   /       |
> >        | 30    40 X 40      | 30
> >        |        /   \       |
> >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > 		  40
> > 
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> 
> I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> this just classic NUMA, where optimizing for locality makes the most
> sense, rather than tiering?
> 
> Forget the interface for a second, I have no idea how tiering on such
> a system would work. One CPU's lower tier can be another CPU's
> toptier. There is no lowest rung from which to actually *reclaim*
> pages. Would the CPUs just demote in circles?
> 
> And the coldest pages on one socket would get demoted into another
> socket and displace what that socket considers hot local memory?
> 
> I feel like I missing something.
> 
> When we're talking about tiered memory, I'm thinking about CPUs
> utilizing more than one memory node. If those other nodes have CPUs,
> you can't reliably establish a singular tier order anymore and it
> becomes classic NUMA, no?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-16  1:11                               ` Ying Huang
@ 2022-06-16  3:45                                 ` Wei Xu
  2022-06-16  4:47                                   ` Aneesh Kumar K V
  2022-06-17 10:41                                 ` Jonathan Cameron
  1 sibling, 1 reply; 84+ messages in thread
From: Wei Xu @ 2022-06-16  3:45 UTC (permalink / raw)
  To: Ying Huang
  Cc: Johannes Weiner, Aneesh Kumar K V, Jonathan Cameron, Linux MM,
	Andrew Morton, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> > > On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > > them in any order using rank values. Without something like rank, if
> > > > > userspace needs to fix things up,  it gets hard with device
> > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > > demotion order by default. (echo 10 >
> > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > > selectively move the new devices to the correct memory tier?
> > > >
> > > > I had touched on this in the other email.
> > > >
> > > > This doesn't work if two drivers that should have separate policies
> > > > collide into the same tier - which is very likely with just 3 tiers.
> > > > So it seems to me the main usecase for having a rank tunable falls
> > > > apart rather quickly until tiers are spaced out more widely. And it
> > > > does so at the cost of an, IMO, tricky to understand interface.
> > > >
> > >
> > > Considering the kernel has a static map for these tiers, how can two drivers
> > > end up using the same tier? If a new driver is going to manage a memory
> > > device that is of different characteristics than the one managed by dax/kmem,
> > > we will end up adding
> > >
> > > #define MEMORY_TIER_NEW_DEVICE 4
> > >
> > > The new driver will never use MEMORY_TIER_PMEM
> > >
> > > What can happen is two devices that are managed by DAX/kmem that
> > > should be in two memory tiers get assigned the same memory tier
> > > because the dax/kmem driver added both the device to the same memory tier.
> > >
> > > In the future we would avoid that by using more device properties like HMAT
> > > to create additional memory tiers with different rank values. ie, we would
> > > do in the dax/kmem create_tier_from_rank() .
> >
> > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> > DRAMs of different speeds etc.
> >
> > I also like Huang's idea of using latency characteristics instead of
> > abstract distances. Though I'm not quite sure how feasible this is in
> > the short term, and share some concerns that Jonathan raised. But I
> > think a wider possible range to begin with makes sense in any case.
> >
> > > > In the other email I had suggested the ability to override not just
> > > > the per-device distance, but also the driver default for new devices
> > > > to handle the hotplug situation.
> > > >
> > >
> > > I understand that the driver override will be done via module parameters.
> > > How will we implement device override? For example in case of dax/kmem driver
> > > the device override will be per dax device? What interface will we use to set the override?
> > >
> > > IIUC in the above proposal the dax/kmem will do
> > >
> > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > >
> > > get_device_tier_index(struct dev_dax *dev)
> > > {
> > >     return dax_kmem_tier_index; // module parameter
> > > }
> > >
> > > Are you suggesting to add a dev_dax property to override the tier defaults?
> >
> > I was thinking a new struct memdevice and struct memtype(?). Every
> > driver implementing memory devices like this sets those up and
> > registers them with generic code and preset parameters. The generic
> > code creates sysfs directories and allows overriding the parameters.
> >
> > struct memdevice {
> >       struct device dev;
> >       unsigned long distance;
> >       struct list_head siblings;
> >       /* nid? ... */
> > };
> >
> > struct memtype {
> >       struct device_type type;
> >       unsigned long default_distance;
> >       struct list_head devices;
> > };
> >
> > That forms the (tweakable) tree describing physical properties.
>
> In general, I think memtype is a good idea.  I have suggested
> something similar before.  It can describe the characters of a
> specific type of memory (same memory media with different interface
> (e.g., CXL, or DIMM) will be different memory types).  And they can
> be used to provide overriding information.
>
> As for memdevice, I think that we already have "node" to represent
> them in sysfs.  Do we really need another one?  Is it sufficient to
> add some links to node in the appropriate directory?  For example,
> make memtype class device under the physical device (e.g. CXL device),
> and create links to node inside the memtype class device directory?
>
> > From that, the kernel then generates the ordered list of tiers.
>
> As Jonathan Cameron pointed, we may need the memory tier ID to be
> stable if possible.  I know this isn't a easy task.  At least we can
> make the default memory tier (CPU local DRAM) ID stable (for example
> make it always 128)?  That provides an anchor for users to understand.

One of the motivations of introducing "rank" is to allow memory tier
ID to be stable, at least for the well-defined tiers such as the
default memory tier.  The default memory tier can be moved around in
the tier hierarchy by adjusting its rank position relative to other
tiers, but its device ID can remain the same, e.g. always 1.

> Best Regards,
> Huang, Ying
>
> > > > This should be less policy than before. Driver default and per-device
> > > > distances (both overridable) combined with one tunable to set the
> > > > range of distances that get grouped into tiers.
> > > >
> > >
> > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > > different distance value from other NUMA nodes. How do we group them?
> > > for ex: earlier discussion did outline three different topologies. Can you
> > > ellaborate how we would end up grouping them using distance?
> > >
> > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > > so how will we classify node 2?
> > >
> > >
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > >
> > >               20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |        \   /       |
> > >        | 30    40 X 40      | 30
> > >        |        /   \       |
> > >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > >               40
> > >
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10
> >
> > I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> > this just classic NUMA, where optimizing for locality makes the most
> > sense, rather than tiering?
> >
> > Forget the interface for a second, I have no idea how tiering on such
> > a system would work. One CPU's lower tier can be another CPU's
> > toptier. There is no lowest rung from which to actually *reclaim*
> > pages. Would the CPUs just demote in circles?
> >
> > And the coldest pages on one socket would get demoted into another
> > socket and displace what that socket considers hot local memory?
> >
> > I feel like I missing something.
> >
> > When we're talking about tiered memory, I'm thinking about CPUs
> > utilizing more than one memory node. If those other nodes have CPUs,
> > you can't reliably establish a singular tier order anymore and it
> > becomes classic NUMA, no?
>
>
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-16  3:45                                 ` Wei Xu
@ 2022-06-16  4:47                                   ` Aneesh Kumar K V
  2022-06-16  5:51                                     ` Ying Huang
  0 siblings, 1 reply; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-16  4:47 UTC (permalink / raw)
  To: Wei Xu, Ying Huang
  Cc: Johannes Weiner, Jonathan Cameron, Linux MM, Andrew Morton,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/16/22 9:15 AM, Wei Xu wrote:
> On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
>>
>> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
>>> On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:

....

>> As Jonathan Cameron pointed, we may need the memory tier ID to be
>> stable if possible.  I know this isn't a easy task.  At least we can
>> make the default memory tier (CPU local DRAM) ID stable (for example
>> make it always 128)?  That provides an anchor for users to understand.
> 
> One of the motivations of introducing "rank" is to allow memory tier
> ID to be stable, at least for the well-defined tiers such as the
> default memory tier.  The default memory tier can be moved around in
> the tier hierarchy by adjusting its rank position relative to other
> tiers, but its device ID can remain the same, e.g. always 1.
> 

With /sys/devices/system/memtier/default_tier userspace will be able query
the default tier details. Did you get to look at 

https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Any reason why that will not work with all the requirements we had?

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-16  4:47                                   ` Aneesh Kumar K V
@ 2022-06-16  5:51                                     ` Ying Huang
  0 siblings, 0 replies; 84+ messages in thread
From: Ying Huang @ 2022-06-16  5:51 UTC (permalink / raw)
  To: Aneesh Kumar K V, Wei Xu
  Cc: Johannes Weiner, Jonathan Cameron, Linux MM, Andrew Morton,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, 2022-06-16 at 10:17 +0530, Aneesh Kumar K V wrote:
> On 6/16/22 9:15 AM, Wei Xu wrote:
> > On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
> > > 
> > > On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > > > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> 
> ....
> 
> > > As Jonathan Cameron pointed, we may need the memory tier ID to be
> > > stable if possible.  I know this isn't a easy task.  At least we can
> > > make the default memory tier (CPU local DRAM) ID stable (for example
> > > make it always 128)?  That provides an anchor for users to understand.
> > 
> > One of the motivations of introducing "rank" is to allow memory tier
> > ID to be stable, at least for the well-defined tiers such as the
> > default memory tier.  The default memory tier can be moved around in
> > the tier hierarchy by adjusting its rank position relative to other
> > tiers, but its device ID can remain the same, e.g. always 1.
> > 
> 
> With /sys/devices/system/memtier/default_tier userspace will be able query
> the default tier details.
> 

Yes.  This is a way to address the memory tier ID stability issue too. 
Anther choice is to make default_tier a symbolic link.


Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-16  1:11                               ` Ying Huang
  2022-06-16  3:45                                 ` Wei Xu
@ 2022-06-17 10:41                                 ` Jonathan Cameron
  2022-06-20  1:54                                   ` Huang, Ying
  1 sibling, 1 reply; 84+ messages in thread
From: Jonathan Cameron @ 2022-06-17 10:41 UTC (permalink / raw)
  To: Ying Huang
  Cc: Johannes Weiner, Aneesh Kumar K V, linux-mm, akpm, Wei Xu,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, 16 Jun 2022 09:11:24 +0800
Ying Huang <ying.huang@intel.com> wrote:

> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:  
> > > On 6/13/22 9:20 PM, Johannes Weiner wrote:  
> > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:  
> > > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > > them in any order using rank values. Without something like rank, if
> > > > > userspace needs to fix things up,  it gets hard with device
> > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > > demotion order by default. (echo 10 >
> > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > > selectively move the new devices to the correct memory tier?  
> > > > 
> > > > I had touched on this in the other email.
> > > > 
> > > > This doesn't work if two drivers that should have separate policies
> > > > collide into the same tier - which is very likely with just 3 tiers.
> > > > So it seems to me the main usecase for having a rank tunable falls
> > > > apart rather quickly until tiers are spaced out more widely. And it
> > > > does so at the cost of an, IMO, tricky to understand interface.
> > > >   
> > > 
> > > Considering the kernel has a static map for these tiers, how can two drivers
> > > end up using the same tier? If a new driver is going to manage a memory
> > > device that is of different characteristics than the one managed by dax/kmem,
> > > we will end up adding 
> > > 
> > > #define MEMORY_TIER_NEW_DEVICE 4
> > > 
> > > The new driver will never use MEMORY_TIER_PMEM
> > > 
> > > What can happen is two devices that are managed by DAX/kmem that
> > > should be in two memory tiers get assigned the same memory tier
> > > because the dax/kmem driver added both the device to the same memory tier.
> > > 
> > > In the future we would avoid that by using more device properties like HMAT
> > > to create additional memory tiers with different rank values. ie, we would
> > > do in the dax/kmem create_tier_from_rank() .  
> > 
> > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> > DRAMs of different speeds etc.
> > 
> > I also like Huang's idea of using latency characteristics instead of
> > abstract distances. Though I'm not quite sure how feasible this is in
> > the short term, and share some concerns that Jonathan raised. But I
> > think a wider possible range to begin with makes sense in any case.
> >   
> > > > In the other email I had suggested the ability to override not just
> > > > the per-device distance, but also the driver default for new devices
> > > > to handle the hotplug situation.
> > > >   
> > > 
> > > I understand that the driver override will be done via module parameters.
> > > How will we implement device override? For example in case of dax/kmem driver
> > > the device override will be per dax device? What interface will we use to set the override?
> > > 
> > > IIUC in the above proposal the dax/kmem will do
> > > 
> > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > > 
> > > get_device_tier_index(struct dev_dax *dev)
> > > {
> > >     return dax_kmem_tier_index; // module parameter
> > > }
> > > 
> > > Are you suggesting to add a dev_dax property to override the tier defaults?  
> > 
> > I was thinking a new struct memdevice and struct memtype(?). Every
> > driver implementing memory devices like this sets those up and
> > registers them with generic code and preset parameters. The generic
> > code creates sysfs directories and allows overriding the parameters.
> > 
> > struct memdevice {
> > 	struct device dev;
> > 	unsigned long distance;
> > 	struct list_head siblings;
> > 	/* nid? ... */
> > };
> > 
> > struct memtype {
> > 	struct device_type type;
> > 	unsigned long default_distance;
> > 	struct list_head devices;
> > };
> > 
> > That forms the (tweakable) tree describing physical properties.  
> 
> In general, I think memtype is a good idea.  I have suggested
> something similar before.  It can describe the characters of a
> specific type of memory (same memory media with different interface
> (e.g., CXL, or DIMM) will be different memory types).  And they can
> be used to provide overriding information.
I'm not sure you are suggesting interface as one element of distinguishing
types, or as the element - just in case it's as 'the element'.
Ignore the next bit if not ;)

Memory "interface" isn't going to be enough of a distinction.  If you want to have
a default distance it would need to be different for cases where the
same 'type' of RAM has very different characteristics. Applies everywhere
but given CXL 'defines' a lot of this - if we just have DRAM attached
via CXL:

1. 16-lane direct attached DRAM device.  (low latency - high bw)
2. 4x 16-lane direct attached DRAM interleaved (low latency - very high bw)
3. 4-lane direct attached DRAM device (low latency - low bandwidth)
4. 16-lane to single switch, 4x 4-lane devices interleaved (mid latency - high bw)
5. 4-lane to single switch, 4x 4-lane devices interleaved (mid latency, mid bw)
6. 4x 16-lane so 4 switch, each switch to 4 DRAM devices (mid latency, very high bw)
(7. 16 land directed attached nvram. (midish latency, high bw - perf wise might be
    similarish to 4).

It could be a lot more complex, but hopefully that conveys that 'type'
is next to useless to characterize things unless we have a very large number
of potential subtypes. If we were on current tiering proposal
we'd just have the CXL subsystem manage multiple tiers to cover what is
attached.

> 
> As for memdevice, I think that we already have "node" to represent
> them in sysfs.  Do we really need another one?  Is it sufficient to
> add some links to node in the appropriate directory?  For example,
> make memtype class device under the physical device (e.g. CXL device),
> and create links to node inside the memtype class device directory?
> 
> > From that, the kernel then generates the ordered list of tiers.  
> 
> As Jonathan Cameron pointed, we may need the memory tier ID to be
> stable if possible.  I know this isn't a easy task.  At least we can
> make the default memory tier (CPU local DRAM) ID stable (for example
> make it always 128)?  That provides an anchor for users to understand.
> 
> Best Regards,
> Huang, Ying
> 
> > > > This should be less policy than before. Driver default and per-device
> > > > distances (both overridable) combined with one tunable to set the
> > > > range of distances that get grouped into tiers.
> > > >   
> > > 
> > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > > different distance value from other NUMA nodes. How do we group them?
> > > for ex: earlier discussion did outline three different topologies. Can you
> > > ellaborate how we would end up grouping them using distance?
> > > 
> > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > > so how will we classify node 2?
> > > 
> > > 
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > > 
> > > 		  20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |        \   /       |
> > >        | 30    40 X 40      | 30
> > >        |        /   \       |
> > >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > > 		  40
> > > 
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10  
> > 
> > I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> > this just classic NUMA, where optimizing for locality makes the most
> > sense, rather than tiering?
> > 
> > Forget the interface for a second, I have no idea how tiering on such
> > a system would work. One CPU's lower tier can be another CPU's
> > toptier. There is no lowest rung from which to actually *reclaim*
> > pages. Would the CPUs just demote in circles?
> > 
> > And the coldest pages on one socket would get demoted into another
> > socket and displace what that socket considers hot local memory?
> > 
> > I feel like I missing something.
> > 
> > When we're talking about tiered memory, I'm thinking about CPUs
> > utilizing more than one memory node. If those other nodes have CPUs,
> > you can't reliably establish a singular tier order anymore and it
> > becomes classic NUMA, no?  
> 
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-17 10:41                                 ` Jonathan Cameron
@ 2022-06-20  1:54                                   ` Huang, Ying
  0 siblings, 0 replies; 84+ messages in thread
From: Huang, Ying @ 2022-06-20  1:54 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Johannes Weiner, Aneesh Kumar K V, linux-mm, akpm, Wei Xu,
	Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=ascii, Size: 9400 bytes --]

Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Thu, 16 Jun 2022 09:11:24 +0800
> Ying Huang <ying.huang@intel.com> wrote:
>
>> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
>> > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:  
>> > > On 6/13/22 9:20 PM, Johannes Weiner wrote:  
>> > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:  
>> > > > > If the kernel still can't make the right decision, userspace could rearrange
>> > > > > them in any order using rank values. Without something like rank, if
>> > > > > userspace needs to fix things up,  it gets hard with device
>> > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
>> > > > > that is hotplugged, park it with a very low-rank value and hence lowest in
>> > > > > demotion order by default. (echo 10 >
>> > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
>> > > > > selectively move the new devices to the correct memory tier?  
>> > > > 
>> > > > I had touched on this in the other email.
>> > > > 
>> > > > This doesn't work if two drivers that should have separate policies
>> > > > collide into the same tier - which is very likely with just 3 tiers.
>> > > > So it seems to me the main usecase for having a rank tunable falls
>> > > > apart rather quickly until tiers are spaced out more widely. And it
>> > > > does so at the cost of an, IMO, tricky to understand interface.
>> > > >   
>> > > 
>> > > Considering the kernel has a static map for these tiers, how can two drivers
>> > > end up using the same tier? If a new driver is going to manage a memory
>> > > device that is of different characteristics than the one managed by dax/kmem,
>> > > we will end up adding 
>> > > 
>> > > #define MEMORY_TIER_NEW_DEVICE 4
>> > > 
>> > > The new driver will never use MEMORY_TIER_PMEM
>> > > 
>> > > What can happen is two devices that are managed by DAX/kmem that
>> > > should be in two memory tiers get assigned the same memory tier
>> > > because the dax/kmem driver added both the device to the same memory tier.
>> > > 
>> > > In the future we would avoid that by using more device properties like HMAT
>> > > to create additional memory tiers with different rank values. ie, we would
>> > > do in the dax/kmem create_tier_from_rank() .  
>> > 
>> > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
>> > DRAMs of different speeds etc.
>> > 
>> > I also like Huang's idea of using latency characteristics instead of
>> > abstract distances. Though I'm not quite sure how feasible this is in
>> > the short term, and share some concerns that Jonathan raised. But I
>> > think a wider possible range to begin with makes sense in any case.
>> >   
>> > > > In the other email I had suggested the ability to override not just
>> > > > the per-device distance, but also the driver default for new devices
>> > > > to handle the hotplug situation.
>> > > >   
>> > > 
>> > > I understand that the driver override will be done via module parameters.
>> > > How will we implement device override? For example in case of dax/kmem driver
>> > > the device override will be per dax device? What interface will we use to set the override?
>> > > 
>> > > IIUC in the above proposal the dax/kmem will do
>> > > 
>> > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
>> > > 
>> > > get_device_tier_index(struct dev_dax *dev)
>> > > {
>> > >  return dax_kmem_tier_index; // module parameter
>> > > }
>> > > 
>> > > Are you suggesting to add a dev_dax property to override the tier defaults?  
>> > 
>> > I was thinking a new struct memdevice and struct memtype(?). Every
>> > driver implementing memory devices like this sets those up and
>> > registers them with generic code and preset parameters. The generic
>> > code creates sysfs directories and allows overriding the parameters.
>> > 
>> > struct memdevice {
>> > 	struct device dev;
>> > 	unsigned long distance;
>> > 	struct list_head siblings;
>> > 	/* nid? ... */
>> > };
>> > 
>> > struct memtype {
>> > 	struct device_type type;
>> > 	unsigned long default_distance;
>> > 	struct list_head devices;
>> > };
>> > 
>> > That forms the (tweakable) tree describing physical properties.  
>> 
>> In general, I think memtype is a good idea.  I have suggested
>> something similar before.  It can describe the characters of a
>> specific type of memory (same memory media with different interface
>> (e.g., CXL, or DIMM) will be different memory types).  And they can
>> be used to provide overriding information.
> I'm not sure you are suggesting interface as one element of distinguishing
> types, or as the element - just in case it's as 'the element'.
> Ignore the next bit if not ;)
>
> Memory "interface" isn't going to be enough of a distinction.  If you want to have
> a default distance it would need to be different for cases where the
> same 'type' of RAM has very different characteristics. Applies everywhere
> but given CXL 'defines' a lot of this - if we just have DRAM attached
> via CXL:
>
> 1. 16-lane direct attached DRAM device.  (low latency - high bw)
> 2. 4x 16-lane direct attached DRAM interleaved (low latency - very high bw)
> 3. 4-lane direct attached DRAM device (low latency - low bandwidth)
> 4. 16-lane to single switch, 4x 4-lane devices interleaved (mid latency - high bw)
> 5. 4-lane to single switch, 4x 4-lane devices interleaved (mid latency, mid bw)
> 6. 4x 16-lane so 4 switch, each switch to 4 DRAM devices (mid latency, very high bw)
> (7. 16 land directed attached nvram. (midish latency, high bw - perf wise might be
>     similarish to 4).
>
> It could be a lot more complex, but hopefully that conveys that 'type'
> is next to useless to characterize things unless we have a very large number
> of potential subtypes. If we were on current tiering proposal
> we'd just have the CXL subsystem manage multiple tiers to cover what is
> attached.

Thanks for detailed explanation.  I learned a lot from you.  Yes,
interface itself isn't enough to character the memory devices.  We need
more fine-grained way to do that.  But anyway, I think that it's better
to identify this via BIOS or kernel instead of user space.

So the kernel drivers will

- group the memory devices enumerated into memory types

- provide latency/bandwidth or distance information for each memory type

Then user space may determine the policy via adjusting the
latency/bandwidth or distance and/or the tiering granularity.

Best Regards,
Huang, Ying

>> 
>> As for memdevice, I think that we already have "node" to represent
>> them in sysfs.  Do we really need another one?  Is it sufficient to
>> add some links to node in the appropriate directory?  For example,
>> make memtype class device under the physical device (e.g. CXL device),
>> and create links to node inside the memtype class device directory?
>> 
>> > From that, the kernel then generates the ordered list of tiers.  
>> 
>> As Jonathan Cameron pointed, we may need the memory tier ID to be
>> stable if possible.  I know this isn't a easy task.  At least we can
>> make the default memory tier (CPU local DRAM) ID stable (for example
>> make it always 128)?  That provides an anchor for users to understand.
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> > > > This should be less policy than before. Driver default and per-device
>> > > > distances (both overridable) combined with one tunable to set the
>> > > > range of distances that get grouped into tiers.
>> > > >   
>> > > 
>> > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
>> > > different distance value from other NUMA nodes. How do we group them?
>> > > for ex: earlier discussion did outline three different topologies. Can you
>> > > ellaborate how we would end up grouping them using distance?
>> > > 
>> > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
>> > > so how will we classify node 2?
>> > > 
>> > > 
>> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>> > > 
>> > > 		  20
>> > > Node 0 (DRAM)  ----  Node 1 (DRAM)
>> > >  |        \   /       |
>> > >  | 30    40 X 40      | 30
>> > >  |        /   \       |
>> > > Node 2 (PMEM)  ----  Node 3 (PMEM)
>> > > 		  40
>> > > 
>> > > node distances:
>> > > node   0    1    2    3
>> > > 0  10   20   30   40
>> > > 1  20   10   40   30
>> > > 2  30   40   10   40
>> > > 3  40   30   40   10  
>> > 
>> > I'm fairly confused by this example. Do all nodes have CPUs? Isn't
>> > this just classic NUMA, where optimizing for locality makes the most
>> > sense, rather than tiering?
>> > 
>> > Forget the interface for a second, I have no idea how tiering on such
>> > a system would work. One CPU's lower tier can be another CPU's
>> > toptier. There is no lowest rung from which to actually *reclaim*
>> > pages. Would the CPUs just demote in circles?
>> > 
>> > And the coldest pages on one socket would get demoted into another
>> > socket and displace what that socket considers hot local memory?
>> > 
>> > I feel like I missing something.
>> > 
>> > When we're talking about tiered memory, I'm thinking about CPUs
>> > utilizing more than one memory node. If those other nodes have CPUs,
>> > you can't reliably establish a singular tier order anymore and it
>> > becomes classic NUMA, no?  
>> 
>> 

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
  2022-06-14 16:45                       ` Jonathan Cameron
@ 2022-06-21  8:27                         ` Aneesh Kumar K V
  0 siblings, 0 replies; 84+ messages in thread
From: Aneesh Kumar K V @ 2022-06-21  8:27 UTC (permalink / raw)
  To: Jonathan Cameron, Johannes Weiner, Tim C Chen
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 6/14/22 10:15 PM, Jonathan Cameron wrote:
> 

...

>>
>> It could be simple tier0, tier1, tier2 numbering again, but the
>> numbers now would mean something to the user. A rank tunable is no
>> longer necessary.
> 
> This feels like it might make tier assignments a bit less stable
> and hence run into question of how to hook up accounting. Not my
> area of expertise though, but it was put forward as one of the reasons
> we didn't want hotplug to potentially end up shuffling other tiers
> around.  The desire was for a 'stable' entity.  Can avoid that with
> 'space' between them but then we sort of still have rank, just in a
> form that makes updating it messy (need to create a new tier to do
> it).
> 
>>

How about we do what is proposed here 

https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

The cgroup accounting patch posted here https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com

looks at top tier accounting per cgroup and I am not sure what tier ID stability is expected
for top tier accounting. 

-aneesh

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2022-06-21  8:29 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-03 13:42 [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-06-07 18:43   ` Tim Chen
2022-06-07 20:18     ` Wei Xu
2022-06-08  4:30     ` Aneesh Kumar K V
2022-06-08  6:06       ` Ying Huang
2022-06-08  4:37     ` Aneesh Kumar K V
2022-06-08  6:10       ` Ying Huang
2022-06-08  8:04         ` Aneesh Kumar K V
2022-06-07 21:32   ` Yang Shi
2022-06-08  1:34     ` Ying Huang
2022-06-08 16:37       ` Yang Shi
2022-06-09  6:52         ` Ying Huang
2022-06-08  4:58     ` Aneesh Kumar K V
2022-06-08  6:18       ` Ying Huang
2022-06-08 16:42       ` Yang Shi
2022-06-09  8:17         ` Aneesh Kumar K V
2022-06-09 16:04           ` Yang Shi
2022-06-08 14:11   ` Johannes Weiner
2022-06-08 14:21     ` Aneesh Kumar K V
2022-06-08 15:55     ` Johannes Weiner
2022-06-08 16:13       ` Aneesh Kumar K V
2022-06-08 18:16         ` Johannes Weiner
2022-06-09  2:33           ` Aneesh Kumar K V
2022-06-09 13:55             ` Johannes Weiner
2022-06-09 14:22               ` Jonathan Cameron
2022-06-09 20:41                 ` Johannes Weiner
2022-06-10  6:15                   ` Ying Huang
2022-06-10  9:57                   ` Jonathan Cameron
2022-06-13 14:05                     ` Johannes Weiner
2022-06-13 14:23                       ` Aneesh Kumar K V
2022-06-13 15:50                         ` Johannes Weiner
2022-06-14  6:48                           ` Ying Huang
2022-06-14  8:01                           ` Aneesh Kumar K V
2022-06-14 18:56                             ` Johannes Weiner
2022-06-15  6:23                               ` Aneesh Kumar K V
2022-06-16  1:11                               ` Ying Huang
2022-06-16  3:45                                 ` Wei Xu
2022-06-16  4:47                                   ` Aneesh Kumar K V
2022-06-16  5:51                                     ` Ying Huang
2022-06-17 10:41                                 ` Jonathan Cameron
2022-06-20  1:54                                   ` Huang, Ying
2022-06-14 16:45                       ` Jonathan Cameron
2022-06-21  8:27                         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-06-07 20:15   ` Tim Chen
2022-06-08  4:55     ` Aneesh Kumar K V
2022-06-08  6:42       ` Ying Huang
2022-06-08 16:06       ` Tim Chen
2022-06-08 16:15         ` Aneesh Kumar K V
2022-06-03 13:42 ` [PATCH v5 3/9] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-06-06 13:39   ` Bharata B Rao
2022-06-03 13:42 ` [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-06-07 22:51   ` Tim Chen
2022-06-08  5:02     ` Aneesh Kumar K V
2022-06-08  6:52     ` Ying Huang
2022-06-08  6:50   ` Ying Huang
2022-06-08  8:19     ` Aneesh Kumar K V
2022-06-08  8:00   ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-07 23:40   ` Tim Chen
2022-06-08  6:59   ` Ying Huang
2022-06-08  8:20     ` Aneesh Kumar K V
2022-06-08  8:23       ` Ying Huang
2022-06-08  8:29         ` Aneesh Kumar K V
2022-06-08  8:34           ` Ying Huang
2022-06-03 13:42 ` [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Aneesh Kumar K.V
2022-06-03 13:42 ` [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-06-06  3:11   ` Ying Huang
2022-06-06  3:52     ` Aneesh Kumar K V
2022-06-06  7:24       ` Ying Huang
2022-06-06  8:33         ` Aneesh Kumar K V
2022-06-08  7:26           ` Ying Huang
2022-06-08  8:28             ` Aneesh Kumar K V
2022-06-08  8:32               ` Ying Huang
2022-06-08 14:37                 ` Aneesh Kumar K.V
2022-06-08 20:14                   ` Tim Chen
2022-06-10  6:04                   ` Ying Huang
2022-06-06  4:53 ` [PATCH] mm/demotion: Add sysfs ABI documentation Aneesh Kumar K.V
2022-06-08 13:57 ` [PATCH v5 0/9] mm/demotion: Memory tiers and demotion Johannes Weiner
2022-06-08 14:20   ` Aneesh Kumar K V
2022-06-09  8:53     ` Jonathan Cameron

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.