linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/8] mm/demotion: Memory tiers and demotion
@ 2022-07-20  2:59 Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

The current kernel has the basic memory tiering support: Inactive pages on a
higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
make room for new allocations on the higher tier NUMA node. Frequently accessed
pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
node to improve the performance.

In the current kernel, memory tiers are defined implicitly via a demotion path
relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed. The
current implementation puts all nodes with CPU into the top tier, and builds the
tier hierarchy tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for several
important use cases:

* The current tier initialization code always initializes each memory-only NUMA
  node into a lower tier. But a memory-only NUMA node may have a high
  performance memory device (e.g. a DRAM device attached via CXL.mem or a
  DRAM-backed memory-only node on a virtual machine) and should be put into a
  higher tier.

* The current tier hierarchy always puts CPU nodes into the top tier. But on a
  system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
  should be in the top tier, and DRAM nodes with CPUs are better to be placed
  into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes into the top
  tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
  CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
  changed, even though no memory node is added or removed. This can make the
  tier hierarchy unstable and make it difficult to support tier-based memory
  accounting.

* A higher tier node can only be demoted to selected nodes on the next lower
  tier as defined by the demotion path, not any other node from any lower tier.
  This strict, hard-coded demotion order does not work in all use cases (e.g.
  some use cases may want to allow cross-socket demotion to another node in the
  same demotion tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to override
  the system-wide, per-node demotion order from the userspace. This demotion
  order is also inconsistent with the page allocation fallback order when all
  the nodes in a higher tier are out of space: The page allocation can fall back
  to any node from any lower tier, whereas the demotion order doesn't allow
  that.

This patch series make the creation of memory tiers explicit under
the control of device driver.

Memory Tier Initialization
==========================

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its performance
level. A memory tier corresponds to a range of performance levels. This allows
for classifying memory devices with a specific performance range into a memory
tier.

By default, all memory nodes are assigned to the default tier with
performance level 512.

A device driver can move its memory nodes from the default tier. For example,
PMEM can move its memory nodes below the default tier, whereas GPU can move its
memory nodes above the default tier.

The kernel initialization code makes the decision on which exact tier a memory
node should be assigned to based on the requests from the device drivers as well
as the memory device hardware information provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Changes from v9:
* Use performance level for initializing memory tiers.

Changes from v8:
* Drop the sysfs interface patches and  related documentation changes.

Changes from v7:
* Fix kernel crash with demotion.
* Improve documentation.

Changes from v6:
* Drop the usage of rank.
* Address other review feedback.

Changes from v5:
* Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
  are going to be used for features other than demotion. Hence keep all N_MEMORY
  nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
* Add NODE_DATA->memtier
* Rearrage patches to add sysfs files later.
* Add support to create memory tiers from userspace.
* Address other review feedback.


Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.



Aneesh Kumar K.V (7):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Move memory demotion related code
  mm/demotion: Add hotplug callbacks to handle new numa node onlined
  mm/demotion/dax/kmem: Set node's performance level to
    MEMTIER_PERF_LEVEL_PMEM
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion: Add pg_data_t member to track node memory tier details
  mm/demotion: Update node_is_toptier to work with memory tiers

Jagdish Gediya (1):
  mm/demotion: Demote pages according to allocation fallback order

 arch/powerpc/platforms/pseries/papr_scm.c |  41 +-
 drivers/acpi/nfit/core.c                  |  41 +-
 include/linux/memory-tiers.h              |  59 +++
 include/linux/migrate.h                   |  15 -
 include/linux/mmzone.h                    |   3 +
 include/linux/node.h                      |  11 +-
 mm/Makefile                               |   1 +
 mm/huge_memory.c                          |   1 +
 mm/memory-tiers.c                         | 590 ++++++++++++++++++++++
 mm/migrate.c                              | 453 +----------------
 mm/mprotect.c                             |   1 +
 mm/vmscan.c                               |  59 ++-
 mm/vmstat.c                               |   4 -
 13 files changed, 782 insertions(+), 497 deletions(-)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

-- 
2.36.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-26  3:53   ` Huang, Ying
  2022-07-20  2:59 ` [PATCH v10 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its performance
level. A memory tier corresponds to a range of performance levels. This allows
for classifying memory devices with a specific performance range into a memory
tier.

This patch configures the range/chunk size to be 128. The default DRAM
performance level is 512. We can have 4 memory tiers below the default DRAM
performance level which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Slower memory devices like persistent memory will have performance levels below
the default DRAM level and hence will be placed in these 4 lower tiers.

While reclaim we migrate pages from fast(higher) tiers to slow(lower) tiers when
the fast(higher) tier is under memory pressure.

A kernel parameter is provided to override the default memory tier.

Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  18 +++++++
 include/linux/node.h         |   6 +++
 mm/Makefile                  |   1 +
 mm/memory-tiers.c            | 101 +++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..f28f9910a4e7
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_NUMA
+/*
+ * Each tier cover a performance level chunk size of 128
+ */
+#define MEMTIER_CHUNK_BITS	7
+/*
+ * For now let's have 4 memory tier below default DRAM tier.
+ */
+#define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
+/* leave one tier below this slow pmem */
+#define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
+
+#endif	/* CONFIG_NUMA */
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..a2a16d4104fd 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -92,6 +92,12 @@ struct node {
 	struct list_head cache_attrs;
 	struct device *cache_dev;
 #endif
+	/*
+	 * For memory devices, perf_level describes
+	 * the device performance and how it should be used
+	 * while building a memory hierarchy.
+	 */
+	int perf_level;
 };
 
 struct memory_block;
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..61bb84c54091
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/node.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	int perf_level;
+	nodemask_t nodelist;
+};
+
+static LIST_HEAD(memory_tiers);
+static DEFINE_MUTEX(memory_tier_lock);
+
+/*
+ * For now let's have 4 memory tier below default DRAM tier.
+ */
+static unsigned int default_memtier_perf_level = MEMTIER_PERF_LEVEL_DRAM;
+core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);
+/*
+ * performance levels are grouped into memtiers each of chunk size
+ * memtier_perf_chunk
+ */
+static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
+{
+	bool found_slot = false;
+	struct list_head *ent;
+	struct memory_tier *memtier, *new_memtier;
+	unsigned int memtier_perf_chunk_size = 1 << MEMTIER_CHUNK_BITS;
+	/*
+	 * zero is special in that it indicates uninitialized
+	 * perf level by respective driver. Pick default memory
+	 * tier perf level for that.
+	 */
+	if (!perf_level)
+		perf_level = default_memtier_perf_level;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	perf_level = round_down(perf_level, memtier_perf_chunk_size);
+	list_for_each(ent, &memory_tiers) {
+		memtier = list_entry(ent, struct memory_tier, list);
+		if (perf_level == memtier->perf_level) {
+			return memtier;
+		} else if (perf_level < memtier->perf_level) {
+			found_slot = true;
+			break;
+		}
+	}
+
+	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!new_memtier)
+		return ERR_PTR(-ENOMEM);
+
+	new_memtier->perf_level = perf_level;
+	if (found_slot)
+		list_add_tail(&new_memtier->list, ent);
+	else
+		list_add_tail(&new_memtier->list, &memory_tiers);
+	return new_memtier;
+}
+
+static int __init memory_tier_init(void)
+{
+	int node;
+	struct memory_tier *memtier;
+
+	/*
+	 * Since this is early during  boot, we could avoid
+	 * holding memtory_tier_lock. But keep it simple by
+	 * holding locks. So we can add lock held debug checks
+	 * in other functions.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = find_create_memory_tier(default_memtier_perf_level);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+
+	/*
+	 * nodes that are already online and that doesn't
+	 * have perf level assigned is assigned a default perf
+	 * level.
+	 */
+	for_each_node_state(node, N_MEMORY) {
+		struct node *node_property = node_devices[node];
+
+		if (!node_property->perf_level)
+			node_property->perf_level = default_memtier_perf_level;
+	}
+	mutex_unlock(&memory_tier_lock);
+	return 0;
+}
+subsys_initcall(memory_tier_init);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 2/8] mm/demotion: Move memory demotion related code
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This move memory demotion related code to mm/memory-tiers.c.
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  6 ++++
 include/linux/migrate.h      |  2 --
 mm/memory-tiers.c            | 62 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 60 +---------------------------------
 mm/vmscan.c                  |  1 +
 5 files changed, 70 insertions(+), 61 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index f28f9910a4e7..ef380a39db3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_MEMORY_TIERS_H
 #define _LINUX_MEMORY_TIERS_H
 
+#include <linux/types.h>
 #ifdef CONFIG_NUMA
 /*
  * Each tier cover a performance level chunk size of 128
@@ -14,5 +15,10 @@
 /* leave one tier below this slow pmem */
 #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
 
+extern bool numa_demotion_enabled;
+
+#else
+
+#define numa_demotion_enabled	false
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..43e737215f33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
 extern void set_migration_target_nodes(void);
 extern void migrate_on_reclaim_init(void);
-extern bool numa_demotion_enabled;
 extern int next_demotion_node(int node);
 #else
 static inline void set_migration_target_nodes(void) {}
@@ -87,7 +86,6 @@ static inline int next_demotion_node(int node)
 {
         return NUMA_NO_NODE;
 }
-#define numa_demotion_enabled  false
 #endif
 
 #ifdef CONFIG_COMPACTION
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 61bb84c54091..41a21cc5ae55 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -99,3 +99,65 @@ static int __init memory_tier_init(void)
 	return 0;
 }
 subsys_initcall(memory_tier_init);
+
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_demotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif /* CONFIG_SYSFS */
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c1ea61f39d8..fce7d4a9e940 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void)
 	set_migration_target_nodes();
 	cpus_read_unlock();
 }
+#endif /* CONFIG_NUMA */
 
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled ? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	ssize_t ret;
-
-	ret = kstrtobool(buf, &numa_demotion_enabled);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
 
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif /* CONFIG_SYSFS */
-#endif /* CONFIG_NUMA */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7..3a8f78277f99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-26  4:03   ` Huang, Ying
  2022-07-20  2:59 ` [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

If the new NUMA node onlined doesn't have a performance level assigned,
the kernel adds the NUMA node to default memory tier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  1 +
 mm/memory-tiers.c            | 75 ++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index ef380a39db3a..3d5f14d57ae6 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -14,6 +14,7 @@
 #define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
 /* leave one tier below this slow pmem */
 #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
+#define MEMTIER_HOTPLUG_PRIO	100
 
 extern bool numa_demotion_enabled;
 
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 41a21cc5ae55..cc3a47ec18e4 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
 #include <linux/node.h>
+#include <linux/memory.h>
 #include <linux/memory-tiers.h>
 
 struct memory_tier {
@@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
 	return new_memtier;
 }
 
+static struct memory_tier *__node_get_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (node_isset(node, memtier->nodelist))
+			return memtier;
+	}
+	return NULL;
+}
+
+static void init_node_memory_tier(int node)
+{
+	int perf_level;
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+
+	memtier = __node_get_memory_tier(node);
+	if (!memtier) {
+		perf_level = node_devices[node]->perf_level;
+		memtier = find_create_memory_tier(perf_level);
+		node_set(node, memtier->nodelist);
+	}
+	mutex_unlock(&memory_tier_lock);
+}
+
+static void clear_node_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (memtier)
+		node_clear(node, memtier->nodelist);
+	mutex_unlock(&memory_tier_lock);
+}
+
+/*
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *_arg)
+{
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case MEM_OFFLINE:
+		clear_node_memory_tier(arg->status_change_nid);
+		break;
+	case MEM_ONLINE:
+		init_node_memory_tier(arg->status_change_nid);
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static void __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
+}
+
 static int __init memory_tier_init(void)
 {
 	int node;
@@ -96,6 +169,8 @@ static int __init memory_tier_init(void)
 			node_property->perf_level = default_memtier_perf_level;
 	}
 	mutex_unlock(&memory_tier_lock);
+
+	migrate_on_reclaim_init();
 	return 0;
 }
 subsys_initcall(memory_tier_init);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2022-07-20  2:59 ` [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-21  6:07   ` kernel test robot
  2022-07-25  6:37   ` Huang, Ying
  2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

By default, all nodes are assigned to the default memory tier which
is the memory tier designated for nodes with DRAM

Set dax kmem device node's tier to slower memory tier by assigning
performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
appears below the default memory tier in demotion order.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
 drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
 2 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 82cae08976bc..3b6164418d6f 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -14,6 +14,8 @@
 #include <linux/delay.h>
 #include <linux/seq_buf.h>
 #include <linux/nd.h>
+#include <linux/memory.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/plpar_wrappers.h>
 #include <asm/papr_pdsm.h>
@@ -98,6 +100,7 @@ struct papr_scm_priv {
 	bool hcall_flush_required;
 
 	uint64_t bound_addr;
+	int target_node;
 
 	struct nvdimm_bus_descriptor bus_desc;
 	struct nvdimm_bus *bus;
@@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	p->bus_desc.module = THIS_MODULE;
 	p->bus_desc.of_node = p->pdev->dev.of_node;
 	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
+	p->target_node = dev_to_node(&p->pdev->dev);
 
 	/* Set the dimm command family mask to accept PDSMs */
 	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
@@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
 
 	memset(&ndr_desc, 0, sizeof(ndr_desc));
-	target_nid = dev_to_node(&p->pdev->dev);
+	target_nid = p->target_node;
 	online_nid = numa_map_to_online_node(target_nid);
 	ndr_desc.numa_node = online_nid;
 	ndr_desc.target_node = target_nid;
@@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
 	},
 };
 
+static int papr_scm_callback(struct notifier_block *self,
+			     unsigned long action, void *arg)
+{
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+	struct papr_scm_priv *p;
+
+	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+		return NOTIFY_OK;
+
+	mutex_lock(&papr_ndr_lock);
+	list_for_each_entry(p, &papr_nd_regions, region_list) {
+		if (p->target_node == nid) {
+			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
+			break;
+		}
+	}
+
+	mutex_unlock(&papr_ndr_lock);
+	return NOTIFY_OK;
+}
+
 static int __init papr_scm_init(void)
 {
 	int ret;
 
 	ret = platform_driver_register(&papr_scm_driver);
-	if (!ret)
-		mce_register_notifier(&mce_ue_nb);
-
-	return ret;
+	if (ret)
+		return ret;
+	mce_register_notifier(&mce_ue_nb);
+	/*
+	 * register a memory hotplug notifier at prio 2 so that we
+	 * can update the perf level for the node.
+	 */
+	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
+	return 0;
 }
 module_init(papr_scm_init);
 
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..7ea1017ef790 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -15,6 +15,8 @@
 #include <linux/sort.h>
 #include <linux/io.h>
 #include <linux/nd.h>
+#include <linux/memory.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <acpi/nfit.h>
 #include "intel.h"
@@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
 	},
 };
 
+static int nfit_callback(struct notifier_block *self,
+			 unsigned long action, void *arg)
+{
+	bool found = false;
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+	struct nfit_spa *nfit_spa;
+	struct acpi_nfit_desc *acpi_desc;
+
+	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+		return NOTIFY_OK;
+
+	mutex_lock(&acpi_desc_lock);
+	list_for_each_entry(acpi_desc, &acpi_descs, list) {
+		mutex_lock(&acpi_desc->init_mutex);
+		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+			struct acpi_nfit_system_address *spa = nfit_spa->spa;
+			int target_node = pxm_to_node(spa->proximity_domain);
+
+			if (target_node == nid) {
+				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&acpi_desc->init_mutex);
+		if (found)
+			break;
+	}
+	mutex_unlock(&acpi_desc_lock);
+	return NOTIFY_OK;
+}
+
 static __init int nfit_init(void)
 {
 	int ret;
@@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
 		nfit_mce_unregister();
 		destroy_workqueue(nfit_wq);
 	}
-
+	/*
+	 * register a memory hotplug notifier at prio 2 so that we
+	 * can update the perf level for the node.
+	 */
+	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
 	return ret;
 
 }
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2022-07-20  2:59 ` [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-20  3:38   ` Aneesh Kumar K.V
                     ` (2 more replies)
  2022-07-20  2:59 ` [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

A new memory tier can be inserted into the tier hierarchy for a new set of nodes
without affecting the node assignment of any existing memtier, provided that
there is enough gap in the performance level values for the new memtier.

The absolute value of performance level of a memtier doesn't necessarily carry
any meaning. Its value relative to other memtiers decides the level of this
memtier in the tier hierarchy.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  12 ++
 include/linux/migrate.h      |  13 --
 mm/memory-tiers.c            | 218 ++++++++++++++++++-
 mm/migrate.c                 | 394 -----------------------------------
 mm/vmstat.c                  |   4 -
 5 files changed, 229 insertions(+), 412 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 3d5f14d57ae6..852e86bd0a23 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -17,9 +17,21 @@
 #define MEMTIER_HOTPLUG_PRIO	100
 
 extern bool numa_demotion_enabled;
+#ifdef CONFIG_MIGRATION
+int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
 
 #else
 
 #define numa_demotion_enabled	false
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 43e737215f33..93fab62e6548 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #endif /* CONFIG_MIGRATION */
 
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
-        return NUMA_NO_NODE;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 extern int PageMovable(struct page *page);
 extern void __SetPageMovable(struct page *page, struct address_space *mapping);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index cc3a47ec18e4..a8cfe2ca3903 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,17 +6,88 @@
 #include <linux/moduleparam.h>
 #include <linux/node.h>
 #include <linux/memory.h>
+#include <linux/random.h>
 #include <linux/memory-tiers.h>
 
+#include "internal.h"
+
 struct memory_tier {
 	struct list_head list;
 	int perf_level;
 	nodemask_t nodelist;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
 static LIST_HEAD(memory_tiers);
 static DEFINE_MUTEX(memory_tier_lock);
 
+#ifdef CONFIG_MIGRATION
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
+#endif /* CONFIG_MIGRATION */
+
 /*
  * For now let's have 4 memory tier below default DRAM tier.
  */
@@ -76,6 +147,136 @@ static struct memory_tier *__node_get_memory_tier(int node)
 	return NULL;
 }
 
+#ifdef CONFIG_MIGRATION
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	struct demotion_nodes *nd;
+	int target;
+
+	if (!node_demotion)
+		return NUMA_NO_NODE;
+
+	nd = &node_demotion[node];
+
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
+	 */
+	rcu_read_lock();
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	target = node_random(&nd->preferred);
+	rcu_read_unlock();
+
+	return target;
+}
+
+/* Disable reclaim-based migration. */
+static void __disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_node_state(node, N_MEMORY)
+		node_demotion[node].preferred = NODE_MASK_NONE;
+}
+
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 */
+	synchronize_rcu();
+}
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_migration_targets(void)
+{
+	struct memory_tier *memtier;
+	struct demotion_nodes *nd;
+	int target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t used;
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+		return;
+
+	disable_all_migrate_targets();
+
+	for_each_node_state(node, N_MEMORY) {
+		best_distance = -1;
+		nd = &node_demotion[node];
+
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+			continue;
+		/*
+		 * Get the next memtier to find the  demotion node list.
+		 */
+		memtier = list_next_entry(memtier, list);
+
+		/*
+		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * Add all memory nodes except the selected memory tier
+		 * nodelist to skip list so that we find the best node from the
+		 * memtier nodelist.
+		 */
+		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
+
+		/*
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
+		 */
+		do {
+			target = find_next_best_node(node, &used);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
+		} while (1);
+	}
+}
+#else
+static inline void disable_all_migrate_targets(void) {}
+static inline void establish_migration_targets(void) {}
+#endif /* CONFIG_MIGRATION */
+
 static void init_node_memory_tier(int node)
 {
 	int perf_level;
@@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
 	mutex_lock(&memory_tier_lock);
 
 	memtier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
+	 * will have skipped this node.
+	 */
 	if (!memtier) {
 		perf_level = node_devices[node]->perf_level;
 		memtier = find_create_memory_tier(perf_level);
 		node_set(node, memtier->nodelist);
 	}
+	establish_migration_targets();
 	mutex_unlock(&memory_tier_lock);
 }
 
@@ -98,8 +307,10 @@ static void clear_node_memory_tier(int node)
 
 	mutex_lock(&memory_tier_lock);
 	memtier = __node_get_memory_tier(node);
-	if (memtier)
+	if (memtier) {
 		node_clear(node, memtier->nodelist);
+		establish_migration_targets();
+	}
 	mutex_unlock(&memory_tier_lock);
 }
 
@@ -134,6 +345,11 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 
 static void __init migrate_on_reclaim_init(void)
 {
+	if (IS_ENABLED(CONFIG_MIGRATION)) {
+		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+					GFP_KERNEL);
+		WARN_ON(!node_demotion);
+	}
 	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fce7d4a9e940..c758c9c21d7d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
-	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	/*
-	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
-	 *
-	 * Make sure to use RCU over entire code blocks if
-	 * node_demotion[] reads need to be consistent.
-	 */
-	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
-
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
-
-	target = READ_ONCE(nd->nodes[index]);
-
-out:
-	rcu_read_unlock();
-	return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
-	int node, i;
-
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
-}
-
-static void disable_all_migrate_targets(void)
-{
-	__disable_all_migrate_targets();
-
-	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
-{
-	int migration_target, index, val;
-	struct demotion_nodes *nd;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
-
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass;
-	nodemask_t this_pass;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
-
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
-
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
-
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
-
-		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
-		 */
-		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
-				break;
-
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
-		} while (1);
-	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *_arg)
-{
-	struct memory_notify *arg = _arg;
-
-	/*
-	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
-	 */
-	if (arg->status_change_nid < 0)
-		return notifier_from_errno(0);
-
-	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
-	case MEM_OFFLINE:
-	case MEM_ONLINE:
-		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_CANCEL_OFFLINE:
-		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		break;
-	}
-
-	return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);
-	WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
-}
 #endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..35c6ff97cf29 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
-#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-	migrate_on_reclaim_init();
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-26  8:02   ` Huang, Ying
  2022-07-20  2:59 ` [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
  2022-07-20  2:59 ` [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
  7 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

Also update different helpes to use NODE_DATA()->memtier. Since
node specific memtier can change based on the reassignment of
NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
needs to happen under an rcu read lock or memory_tier_lock.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/mmzone.h |  3 ++
 mm/memory-tiers.c      | 65 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..353812495a70 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_NUMA
+	struct memory_tier __rcu *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index a8cfe2ca3903..4715f9b96a44 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -138,13 +138,18 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
 
 static struct memory_tier *__node_get_memory_tier(int node)
 {
-	struct memory_tier *memtier;
+	pg_data_t *pgdat;
 
-	list_for_each_entry(memtier, &memory_tiers, list) {
-		if (node_isset(node, memtier->nodelist))
-			return memtier;
-	}
-	return NULL;
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return NULL;
+	/*
+	 * Since we hold memory_tier_lock, we can avoid
+	 * RCU read locks when accessing the details. No
+	 * parallel updates are possible here.
+	 */
+	return rcu_dereference_check(pgdat->memtier,
+				     lockdep_is_held(&memory_tier_lock));
 }
 
 #ifdef CONFIG_MIGRATION
@@ -277,6 +282,29 @@ static inline void disable_all_migrate_targets(void) {}
 static inline void establish_migration_targets(void) {}
 #endif /* CONFIG_MIGRATION */
 
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+	struct memory_tier *current_memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	current_memtier = rcu_dereference_check(pgdat->memtier,
+						lockdep_is_held(&memory_tier_lock));
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	synchronize_rcu();
+	if (current_memtier)
+		node_clear(node, current_memtier->nodelist);
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
 static void init_node_memory_tier(int node)
 {
 	int perf_level;
@@ -295,7 +323,7 @@ static void init_node_memory_tier(int node)
 	if (!memtier) {
 		perf_level = node_devices[node]->perf_level;
 		memtier = find_create_memory_tier(perf_level);
-		node_set(node, memtier->nodelist);
+		memtier_node_set(node, memtier);
 	}
 	establish_migration_targets();
 	mutex_unlock(&memory_tier_lock);
@@ -303,12 +331,25 @@ static void init_node_memory_tier(int node)
 
 static void clear_node_memory_tier(int node)
 {
-	struct memory_tier *memtier;
+	pg_data_t *pgdat;
+	struct memory_tier *current_memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
 
 	mutex_lock(&memory_tier_lock);
-	memtier = __node_get_memory_tier(node);
-	if (memtier) {
-		node_clear(node, memtier->nodelist);
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	current_memtier = rcu_dereference_check(pgdat->memtier,
+						lockdep_is_held(&memory_tier_lock));
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	synchronize_rcu();
+	if (current_memtier) {
+		node_clear(node, current_memtier->nodelist);
 		establish_migration_targets();
 	}
 	mutex_unlock(&memory_tier_lock);
@@ -383,6 +424,8 @@ static int __init memory_tier_init(void)
 
 		if (!node_property->perf_level)
 			node_property->perf_level = default_memtier_perf_level;
+
+		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
 	}
 	mutex_unlock(&memory_tier_lock);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2022-07-20  2:59 ` [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-26  8:24   ` Huang, Ying
  2022-07-20  2:59 ` [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
  7 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya,
	Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya.oss@gmail.com>

Currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path.
This strict, hard-coded demotion order does not work in all
use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback
when the preferred demotion node is out of space). This demotion
order is also inconsistent with the page allocation fallback order
when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas
the demotion order doesn't allow that currently.

This patch adds support to get all the allowed demotion targets
for a memory tier. demote_page_list() function is now modified
to utilize this allowed node mask as the fallback allocation mask.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 11 +++++++
 mm/memory-tiers.c            | 54 +++++++++++++++++++++++++++++++--
 mm/vmscan.c                  | 58 ++++++++++++++++++++++++++----------
 3 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 852e86bd0a23..0e58588fa066 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -19,11 +19,17 @@
 extern bool numa_demotion_enabled;
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 #else
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif
 
 #else
@@ -33,5 +39,10 @@ static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 4715f9b96a44..4a96e4213d66 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -15,6 +15,7 @@ struct memory_tier {
 	struct list_head list;
 	int perf_level;
 	nodemask_t nodelist;
+	nodemask_t lower_tier_mask;
 };
 
 struct demotion_nodes {
@@ -153,6 +154,24 @@ static struct memory_tier *__node_get_memory_tier(int node)
 }
 
 #ifdef CONFIG_MIGRATION
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * pg_data_t.memtier updates includes a synchronize_rcu()
+	 * which ensures that we either find NULL or a valid memtier
+	 * in NODE_DATA. protect the access via rcu_read_lock();
+	 */
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (memtier)
+		*targets = memtier->lower_tier_mask;
+	else
+		*targets = NODE_MASK_NONE;
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -201,10 +220,19 @@ int next_demotion_node(int node)
 /* Disable reclaim-based migration. */
 static void __disable_all_migrate_targets(void)
 {
+	struct memory_tier *memtier;
 	int node;
 
-	for_each_node_state(node, N_MEMORY)
+	for_each_node_state(node, N_MEMORY) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		/*
+		 * We are holding memory_tier_lock, it is safe
+		 * to access pgda->memtier.
+		 */
+		memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
+						lockdep_is_held(&memory_tier_lock));
+		memtier->lower_tier_mask = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -230,7 +258,7 @@ static void establish_migration_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
+	nodemask_t used, lower_tier = NODE_MASK_NONE;
 
 	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
 		return;
@@ -276,6 +304,28 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Now build the lower_tier mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(lower_tier, lower_tier, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(lower_tier, node_states[N_MEMORY], lower_tier);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from lower_tier nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the lower_tier mask.
+		 */
+		nodes_andnot(lower_tier, lower_tier, memtier->nodelist);
+		memtier->lower_tier_mask = lower_tier;
+	}
 }
 #else
 static inline void disable_all_migrate_targets(void) {}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..60a5235dd639 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
+	struct page *target_page;
+	nodemask_t *allowed_mask;
+	struct migration_target_control *mtc;
+
+	mtc = (struct migration_target_control *)private;
+
+	allowed_mask = mtc->nmask;
+	/*
+	 * make sure we allocate from the target node first also trying to
+	 * reclaim pages from the target node via kswapd if we are low on
+	 * free memory on target node. If we don't do this and if we have low
+	 * free memory on the target memtier, we would start allocating pages
+	 * from higher memory tiers without even forcing a demotion of cold
+	 * pages from the target memtier. This can result in the kernel placing
+	 * hotpages in higher memory tiers.
+	 */
+	mtc->nmask = NULL;
+	mtc->gfp_mask |= __GFP_THISNODE;
+	target_page = alloc_migration_target(page, (unsigned long)mtc);
+	if (target_page)
+		return target_page;
 
-	return alloc_migration_target(page, (unsigned long)&mtc);
+	mtc->gfp_mask &= ~__GFP_THISNODE;
+	mtc->nmask = allowed_mask;
+
+	return alloc_migration_target(page, (unsigned long)mtc);
 }
 
 /*
@@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2022-07-20  2:59 ` [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-07-20  2:59 ` Aneesh Kumar K.V
  2022-07-25  8:54   ` Huang, Ying
  7 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  2:59 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

With memory tiers support we can have memory only NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers.
All NUMA nodes are by default top tier nodes. With lower memory
tiers added we consider all memory tiers above a memory tier having
CPU NUMA nodes as a top memory tier

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 11 +++++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 43 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0e58588fa066..085dd815bf73 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -20,6 +20,7 @@ extern bool numa_demotion_enabled;
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
+bool node_is_toptier(int node);
 #else
 static inline int next_demotion_node(int node)
 {
@@ -30,6 +31,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif
 
 #else
@@ -44,5 +50,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index a2a16d4104fd..d0432db18094 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -191,9 +191,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 834f288b3769..8405662646e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 4a96e4213d66..f0515bfd4051 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -13,6 +13,7 @@
 
 struct memory_tier {
 	struct list_head list;
+	int id;
 	int perf_level;
 	nodemask_t nodelist;
 	nodemask_t lower_tier_mask;
@@ -26,6 +27,7 @@ static LIST_HEAD(memory_tiers);
 static DEFINE_MUTEX(memory_tier_lock);
 
 #ifdef CONFIG_MIGRATION
+static int top_tier_id;
 /*
  * node_demotion[] examples:
  *
@@ -129,6 +131,7 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
 	if (!new_memtier)
 		return ERR_PTR(-ENOMEM);
 
+	new_memtier->id = perf_level >> MEMTIER_CHUNK_BITS;
 	new_memtier->perf_level = perf_level;
 	if (found_slot)
 		list_add_tail(&new_memtier->list, ent);
@@ -154,6 +157,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
 }
 
 #ifdef CONFIG_MIGRATION
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->id >= top_tier_id)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 {
 	struct memory_tier *memtier;
@@ -304,6 +332,21 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Promotion is allowed from a memory tier to higher
+	 * memory tier only if the memory tier doesn't include
+	 * compute. We want to  skip promotion from a memory tier,
+	 * if any node that is  part of the memory tier have CPUs.
+	 * Once we detect such a memory tier, we consider that tier
+	 * as top tiper from which promotion is not allowed.
+	 */
+	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
+		nodes_and(used, node_states[N_CPU], memtier->nodelist);
+		if (!nodes_empty(used)) {
+			top_tier_id = memtier->id;
+			break;
+		}
+	}
 	/*
 	 * Now build the lower_tier mask for each node collecting node mask from
 	 * all memory tier below it. This allows us to fallback demotion page
diff --git a/mm/migrate.c b/mm/migrate.c
index c758c9c21d7d..1da81136eaaa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-20  3:38   ` Aneesh Kumar K.V
  2022-07-21  0:02   ` kernel test robot
  2022-07-26  7:44   ` Huang, Ying
  2 siblings, 0 replies; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-20  3:38 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss


I missed folding a list walking fix. Use this diff on top
for testing.

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index b2da34a1f06c..f3d720b7dc6c 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -300,12 +300,12 @@ static void establish_migration_targets(void)
 		nd = &node_demotion[node];
 
 		memtier = __node_get_memory_tier(node);
-		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+		if (!memtier || list_is_first(&memtier->list, &memory_tiers))
 			continue;
 		/*
 		 * Get the next memtier to find the  demotion node list.
 		 */
-		memtier = list_next_entry(memtier, list);
+		memtier = list_prev_entry(memtier, list);
 
 		/*
 		 * find_next_best_node, use 'used' nodemask as a skip list.
@@ -342,7 +342,7 @@ static void establish_migration_targets(void)
 	 * Once we detect such a memory tier, we consider that tier
 	 * as top tiper from which promotion is not allowed.
 	 */
-	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
+	list_for_each_entry(memtier, &memory_tiers, list) {
 		nodes_and(used, node_states[N_CPU], memtier->nodelist);
 		if (!nodes_empty(used)) {
 			top_tier_id = memtier->id;
@@ -364,7 +364,7 @@ static void establish_migration_targets(void)
 	 */
 	nodes_and(lower_tier, node_states[N_MEMORY], lower_tier);
 
-	list_for_each_entry(memtier, &memory_tiers, list) {
+	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
 		/*
 		 * Keep removing current tier from lower_tier nodes,
 		 * This will remove all nodes in current and above


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
  2022-07-20  3:38   ` Aneesh Kumar K.V
@ 2022-07-21  0:02   ` kernel test robot
  2022-07-26  7:44   ` Huang, Ying
  2 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2022-07-21  0:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: kbuild-all, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

Hi "Aneesh,

I love your patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220720-110356
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: arm64-randconfig-r025-20220718 (https://download.01.org/0day-ci/archive/20220721/202207210723.HeaXhM1j-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/b5fc1c467550e9f07b8f128b554a7d68e13628ff
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220720-110356
        git checkout b5fc1c467550e9f07b8f128b554a7d68e13628ff
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/memory-tiers.c: In function 'migrate_on_reclaim_init':
>> mm/memory-tiers.c:349:17: error: 'node_demotion' undeclared (first use in this function)
     349 |                 node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
         |                 ^~~~~~~~~~~~~
   mm/memory-tiers.c:349:17: note: each undeclared identifier is reported only once for each function it appears in


vim +/node_demotion +349 mm/memory-tiers.c

   345	
   346	static void __init migrate_on_reclaim_init(void)
   347	{
   348		if (IS_ENABLED(CONFIG_MIGRATION)) {
 > 349			node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
   350						GFP_KERNEL);
   351			WARN_ON(!node_demotion);
   352		}
   353		hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
   354	}
   355	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-20  2:59 ` [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Aneesh Kumar K.V
@ 2022-07-21  6:07   ` kernel test robot
  2022-07-25  6:37   ` Huang, Ying
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2022-07-21  6:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: llvm, kbuild-all, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

Hi "Aneesh,

I love your patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220720-110356
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: x86_64-randconfig-r004-20220718 (https://download.01.org/0day-ci/archive/20220721/202207211403.1K7X9mSi-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project dd5635541cd7bbd62cd59b6694dfb759b6e9a0d8)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d94a14b8fe93ff567d64b793ce1939698ca0b834
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Aneesh-Kumar-K-V/mm-demotion-Memory-tiers-and-demotion/20220720-110356
        git checkout d94a14b8fe93ff567d64b793ce1939698ca0b834
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/acpi/nfit/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/acpi/nfit/core.c:1719:13: warning: no previous prototype for function 'nfit_intel_shutdown_status' [-Wmissing-prototypes]
   __weak void nfit_intel_shutdown_status(struct nfit_mem *nfit_mem)
               ^
   drivers/acpi/nfit/core.c:1719:8: note: declare 'static' if the function is not intended to be used outside of this translation unit
   __weak void nfit_intel_shutdown_status(struct nfit_mem *nfit_mem)
          ^
          static 
>> drivers/acpi/nfit/core.c:3495:37: error: use of undeclared identifier 'MEMTIER_PERF_LEVEL_PMEM'
                                   node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
                                                                   ^
>> drivers/acpi/nfit/core.c:3551:41: error: use of undeclared identifier 'MEMTIER_HOTPLUG_PRIO'
           hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
                                                  ^
   1 warning and 2 errors generated.


vim +/MEMTIER_PERF_LEVEL_PMEM +3495 drivers/acpi/nfit/core.c

  3474	
  3475	static int nfit_callback(struct notifier_block *self,
  3476				 unsigned long action, void *arg)
  3477	{
  3478		bool found = false;
  3479		struct memory_notify *mnb = arg;
  3480		int nid = mnb->status_change_nid;
  3481		struct nfit_spa *nfit_spa;
  3482		struct acpi_nfit_desc *acpi_desc;
  3483	
  3484		if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
  3485			return NOTIFY_OK;
  3486	
  3487		mutex_lock(&acpi_desc_lock);
  3488		list_for_each_entry(acpi_desc, &acpi_descs, list) {
  3489			mutex_lock(&acpi_desc->init_mutex);
  3490			list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
  3491				struct acpi_nfit_system_address *spa = nfit_spa->spa;
  3492				int target_node = pxm_to_node(spa->proximity_domain);
  3493	
  3494				if (target_node == nid) {
> 3495					node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
  3496					found = true;
  3497					break;
  3498				}
  3499			}
  3500			mutex_unlock(&acpi_desc->init_mutex);
  3501			if (found)
  3502				break;
  3503		}
  3504		mutex_unlock(&acpi_desc_lock);
  3505		return NOTIFY_OK;
  3506	}
  3507	
  3508	static __init int nfit_init(void)
  3509	{
  3510		int ret;
  3511	
  3512		BUILD_BUG_ON(sizeof(struct acpi_table_nfit) != 40);
  3513		BUILD_BUG_ON(sizeof(struct acpi_nfit_system_address) != 64);
  3514		BUILD_BUG_ON(sizeof(struct acpi_nfit_memory_map) != 48);
  3515		BUILD_BUG_ON(sizeof(struct acpi_nfit_interleave) != 20);
  3516		BUILD_BUG_ON(sizeof(struct acpi_nfit_smbios) != 9);
  3517		BUILD_BUG_ON(sizeof(struct acpi_nfit_control_region) != 80);
  3518		BUILD_BUG_ON(sizeof(struct acpi_nfit_data_region) != 40);
  3519		BUILD_BUG_ON(sizeof(struct acpi_nfit_capabilities) != 16);
  3520	
  3521		guid_parse(UUID_VOLATILE_MEMORY, &nfit_uuid[NFIT_SPA_VOLATILE]);
  3522		guid_parse(UUID_PERSISTENT_MEMORY, &nfit_uuid[NFIT_SPA_PM]);
  3523		guid_parse(UUID_CONTROL_REGION, &nfit_uuid[NFIT_SPA_DCR]);
  3524		guid_parse(UUID_DATA_REGION, &nfit_uuid[NFIT_SPA_BDW]);
  3525		guid_parse(UUID_VOLATILE_VIRTUAL_DISK, &nfit_uuid[NFIT_SPA_VDISK]);
  3526		guid_parse(UUID_VOLATILE_VIRTUAL_CD, &nfit_uuid[NFIT_SPA_VCD]);
  3527		guid_parse(UUID_PERSISTENT_VIRTUAL_DISK, &nfit_uuid[NFIT_SPA_PDISK]);
  3528		guid_parse(UUID_PERSISTENT_VIRTUAL_CD, &nfit_uuid[NFIT_SPA_PCD]);
  3529		guid_parse(UUID_NFIT_BUS, &nfit_uuid[NFIT_DEV_BUS]);
  3530		guid_parse(UUID_NFIT_DIMM, &nfit_uuid[NFIT_DEV_DIMM]);
  3531		guid_parse(UUID_NFIT_DIMM_N_HPE1, &nfit_uuid[NFIT_DEV_DIMM_N_HPE1]);
  3532		guid_parse(UUID_NFIT_DIMM_N_HPE2, &nfit_uuid[NFIT_DEV_DIMM_N_HPE2]);
  3533		guid_parse(UUID_NFIT_DIMM_N_MSFT, &nfit_uuid[NFIT_DEV_DIMM_N_MSFT]);
  3534		guid_parse(UUID_NFIT_DIMM_N_HYPERV, &nfit_uuid[NFIT_DEV_DIMM_N_HYPERV]);
  3535		guid_parse(UUID_INTEL_BUS, &nfit_uuid[NFIT_BUS_INTEL]);
  3536	
  3537		nfit_wq = create_singlethread_workqueue("nfit");
  3538		if (!nfit_wq)
  3539			return -ENOMEM;
  3540	
  3541		nfit_mce_register();
  3542		ret = acpi_bus_register_driver(&acpi_nfit_driver);
  3543		if (ret) {
  3544			nfit_mce_unregister();
  3545			destroy_workqueue(nfit_wq);
  3546		}
  3547		/*
  3548		 * register a memory hotplug notifier at prio 2 so that we
  3549		 * can update the perf level for the node.
  3550		 */
> 3551		hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
  3552		return ret;
  3553	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-20  2:59 ` [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Aneesh Kumar K.V
  2022-07-21  6:07   ` kernel test robot
@ 2022-07-25  6:37   ` Huang, Ying
  2022-07-25  6:48     ` Aneesh Kumar K V
  1 sibling, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-25  6:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> By default, all nodes are assigned to the default memory tier which
> is the memory tier designated for nodes with DRAM
>
> Set dax kmem device node's tier to slower memory tier by assigning
> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
> appears below the default memory tier in demotion order.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>  2 files changed, 76 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index 82cae08976bc..3b6164418d6f 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -14,6 +14,8 @@
>  #include <linux/delay.h>
>  #include <linux/seq_buf.h>
>  #include <linux/nd.h>
> +#include <linux/memory.h>
> +#include <linux/memory-tiers.h>
>  
>  #include <asm/plpar_wrappers.h>
>  #include <asm/papr_pdsm.h>
> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>  	bool hcall_flush_required;
>  
>  	uint64_t bound_addr;
> +	int target_node;
>  
>  	struct nvdimm_bus_descriptor bus_desc;
>  	struct nvdimm_bus *bus;
> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  	p->bus_desc.module = THIS_MODULE;
>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
> +	p->target_node = dev_to_node(&p->pdev->dev);
>  
>  	/* Set the dimm command family mask to accept PDSMs */
>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>  
>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
> -	target_nid = dev_to_node(&p->pdev->dev);
> +	target_nid = p->target_node;
>  	online_nid = numa_map_to_online_node(target_nid);
>  	ndr_desc.numa_node = online_nid;
>  	ndr_desc.target_node = target_nid;
> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>  	},
>  };
>  
> +static int papr_scm_callback(struct notifier_block *self,
> +			     unsigned long action, void *arg)
> +{
> +	struct memory_notify *mnb = arg;
> +	int nid = mnb->status_change_nid;
> +	struct papr_scm_priv *p;
> +
> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> +		return NOTIFY_OK;
> +
> +	mutex_lock(&papr_ndr_lock);
> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
> +		if (p->target_node == nid) {
> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&papr_ndr_lock);
> +	return NOTIFY_OK;
> +}
> +
>  static int __init papr_scm_init(void)
>  {
>  	int ret;
>  
>  	ret = platform_driver_register(&papr_scm_driver);
> -	if (!ret)
> -		mce_register_notifier(&mce_ue_nb);
> -
> -	return ret;
> +	if (ret)
> +		return ret;
> +	mce_register_notifier(&mce_ue_nb);
> +	/*
> +	 * register a memory hotplug notifier at prio 2 so that we
> +	 * can update the perf level for the node.
> +	 */
> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
> +	return 0;
>  }
>  module_init(papr_scm_init);
>  
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index ae5f4acf2675..7ea1017ef790 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -15,6 +15,8 @@
>  #include <linux/sort.h>
>  #include <linux/io.h>
>  #include <linux/nd.h>
> +#include <linux/memory.h>
> +#include <linux/memory-tiers.h>
>  #include <asm/cacheflush.h>
>  #include <acpi/nfit.h>
>  #include "intel.h"
> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>  	},
>  };
>  
> +static int nfit_callback(struct notifier_block *self,
> +			 unsigned long action, void *arg)
> +{
> +	bool found = false;
> +	struct memory_notify *mnb = arg;
> +	int nid = mnb->status_change_nid;
> +	struct nfit_spa *nfit_spa;
> +	struct acpi_nfit_desc *acpi_desc;
> +
> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
> +		return NOTIFY_OK;
> +
> +	mutex_lock(&acpi_desc_lock);
> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
> +		mutex_lock(&acpi_desc->init_mutex);
> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
> +			int target_node = pxm_to_node(spa->proximity_domain);
> +
> +			if (target_node == nid) {
> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
> +				found = true;
> +				break;
> +			}
> +		}
> +		mutex_unlock(&acpi_desc->init_mutex);
> +		if (found)
> +			break;
> +	}
> +	mutex_unlock(&acpi_desc_lock);
> +	return NOTIFY_OK;
> +}
> +
>  static __init int nfit_init(void)
>  {
>  	int ret;
> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>  		nfit_mce_unregister();
>  		destroy_workqueue(nfit_wq);
>  	}
> -
> +	/*
> +	 * register a memory hotplug notifier at prio 2 so that we
> +	 * can update the perf level for the node.
> +	 */
> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>  	return ret;
>  
>  }

I don't think that it's a good idea to set perf_level of a memory device
(node) via NFIT only.

For example, we may prefer HMAT over NFIT when it's available.  So the
perf_level should be set in dax/kmem.c based on information provided by
ACPI or other information sources.  ACPI can provide some functions/data
structures to let drivers (like dax/kmem.c) to query the properties of
the memory device (node).

As the simplest first version, this can be just hard coded.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-25  6:37   ` Huang, Ying
@ 2022-07-25  6:48     ` Aneesh Kumar K V
  2022-07-25  8:35       ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-25  6:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/25/22 12:07 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> By default, all nodes are assigned to the default memory tier which
>> is the memory tier designated for nodes with DRAM
>>
>> Set dax kmem device node's tier to slower memory tier by assigning
>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>> appears below the default memory tier in demotion order.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 82cae08976bc..3b6164418d6f 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -14,6 +14,8 @@
>>  #include <linux/delay.h>
>>  #include <linux/seq_buf.h>
>>  #include <linux/nd.h>
>> +#include <linux/memory.h>
>> +#include <linux/memory-tiers.h>
>>  
>>  #include <asm/plpar_wrappers.h>
>>  #include <asm/papr_pdsm.h>
>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>  	bool hcall_flush_required;
>>  
>>  	uint64_t bound_addr;
>> +	int target_node;
>>  
>>  	struct nvdimm_bus_descriptor bus_desc;
>>  	struct nvdimm_bus *bus;
>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>  	p->bus_desc.module = THIS_MODULE;
>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>  
>>  	/* Set the dimm command family mask to accept PDSMs */
>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>  
>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>> -	target_nid = dev_to_node(&p->pdev->dev);
>> +	target_nid = p->target_node;
>>  	online_nid = numa_map_to_online_node(target_nid);
>>  	ndr_desc.numa_node = online_nid;
>>  	ndr_desc.target_node = target_nid;
>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>  	},
>>  };
>>  
>> +static int papr_scm_callback(struct notifier_block *self,
>> +			     unsigned long action, void *arg)
>> +{
>> +	struct memory_notify *mnb = arg;
>> +	int nid = mnb->status_change_nid;
>> +	struct papr_scm_priv *p;
>> +
>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>> +		return NOTIFY_OK;
>> +
>> +	mutex_lock(&papr_ndr_lock);
>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>> +		if (p->target_node == nid) {
>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>> +			break;
>> +		}
>> +	}
>> +
>> +	mutex_unlock(&papr_ndr_lock);
>> +	return NOTIFY_OK;
>> +}
>> +
>>  static int __init papr_scm_init(void)
>>  {
>>  	int ret;
>>  
>>  	ret = platform_driver_register(&papr_scm_driver);
>> -	if (!ret)
>> -		mce_register_notifier(&mce_ue_nb);
>> -
>> -	return ret;
>> +	if (ret)
>> +		return ret;
>> +	mce_register_notifier(&mce_ue_nb);
>> +	/*
>> +	 * register a memory hotplug notifier at prio 2 so that we
>> +	 * can update the perf level for the node.
>> +	 */
>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>> +	return 0;
>>  }
>>  module_init(papr_scm_init);
>>  
>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>> index ae5f4acf2675..7ea1017ef790 100644
>> --- a/drivers/acpi/nfit/core.c
>> +++ b/drivers/acpi/nfit/core.c
>> @@ -15,6 +15,8 @@
>>  #include <linux/sort.h>
>>  #include <linux/io.h>
>>  #include <linux/nd.h>
>> +#include <linux/memory.h>
>> +#include <linux/memory-tiers.h>
>>  #include <asm/cacheflush.h>
>>  #include <acpi/nfit.h>
>>  #include "intel.h"
>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>  	},
>>  };
>>  
>> +static int nfit_callback(struct notifier_block *self,
>> +			 unsigned long action, void *arg)
>> +{
>> +	bool found = false;
>> +	struct memory_notify *mnb = arg;
>> +	int nid = mnb->status_change_nid;
>> +	struct nfit_spa *nfit_spa;
>> +	struct acpi_nfit_desc *acpi_desc;
>> +
>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>> +		return NOTIFY_OK;
>> +
>> +	mutex_lock(&acpi_desc_lock);
>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>> +		mutex_lock(&acpi_desc->init_mutex);
>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>> +			int target_node = pxm_to_node(spa->proximity_domain);
>> +
>> +			if (target_node == nid) {
>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>> +				found = true;
>> +				break;
>> +			}
>> +		}
>> +		mutex_unlock(&acpi_desc->init_mutex);
>> +		if (found)
>> +			break;
>> +	}
>> +	mutex_unlock(&acpi_desc_lock);
>> +	return NOTIFY_OK;
>> +}
>> +
>>  static __init int nfit_init(void)
>>  {
>>  	int ret;
>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>  		nfit_mce_unregister();
>>  		destroy_workqueue(nfit_wq);
>>  	}
>> -
>> +	/*
>> +	 * register a memory hotplug notifier at prio 2 so that we
>> +	 * can update the perf level for the node.
>> +	 */
>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>  	return ret;
>>  
>>  }
> 
> I don't think that it's a good idea to set perf_level of a memory device
> (node) via NFIT only.

> 
> For example, we may prefer HMAT over NFIT when it's available.  So the
> perf_level should be set in dax/kmem.c based on information provided by
> ACPI or other information sources.  ACPI can provide some functions/data
> structures to let drivers (like dax/kmem.c) to query the properties of
> the memory device (node).
> 

I was trying to make it architecture specific so that we have a placeholder
to fine-tune this better. For example, ppc64 will look at device tree
details to find the performance level and x86 will look at ACPI data structure.
Adding that hotplug callback in dax/kmem will prevent that architecture-specific
customization? 

I would expect that callback to move to the generic ACPI layer so that even
firmware managed CXL devices can be added to a lower tier?  I don't understand
ACPI enough to find the right abstraction for that hotplug callback. 


> As the simplest first version, this can be just hard coded.
> 

If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
get allocated pretty late when we try to online the node. 

> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-25  6:48     ` Aneesh Kumar K V
@ 2022-07-25  8:35       ` Huang, Ying
  2022-07-25  8:42         ` Aneesh Kumar K V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-25  8:35 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/25/22 12:07 PM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> By default, all nodes are assigned to the default memory tier which
>>> is the memory tier designated for nodes with DRAM
>>>
>>> Set dax kmem device node's tier to slower memory tier by assigning
>>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>>> appears below the default memory tier in demotion order.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>> index 82cae08976bc..3b6164418d6f 100644
>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>> @@ -14,6 +14,8 @@
>>>  #include <linux/delay.h>
>>>  #include <linux/seq_buf.h>
>>>  #include <linux/nd.h>
>>> +#include <linux/memory.h>
>>> +#include <linux/memory-tiers.h>
>>>  
>>>  #include <asm/plpar_wrappers.h>
>>>  #include <asm/papr_pdsm.h>
>>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>>  	bool hcall_flush_required;
>>>  
>>>  	uint64_t bound_addr;
>>> +	int target_node;
>>>  
>>>  	struct nvdimm_bus_descriptor bus_desc;
>>>  	struct nvdimm_bus *bus;
>>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>  	p->bus_desc.module = THIS_MODULE;
>>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>>  
>>>  	/* Set the dimm command family mask to accept PDSMs */
>>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>>  
>>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>>> -	target_nid = dev_to_node(&p->pdev->dev);
>>> +	target_nid = p->target_node;
>>>  	online_nid = numa_map_to_online_node(target_nid);
>>>  	ndr_desc.numa_node = online_nid;
>>>  	ndr_desc.target_node = target_nid;
>>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>>  	},
>>>  };
>>>  
>>> +static int papr_scm_callback(struct notifier_block *self,
>>> +			     unsigned long action, void *arg)
>>> +{
>>> +	struct memory_notify *mnb = arg;
>>> +	int nid = mnb->status_change_nid;
>>> +	struct papr_scm_priv *p;
>>> +
>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>> +		return NOTIFY_OK;
>>> +
>>> +	mutex_lock(&papr_ndr_lock);
>>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>>> +		if (p->target_node == nid) {
>>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>> +			break;
>>> +		}
>>> +	}
>>> +
>>> +	mutex_unlock(&papr_ndr_lock);
>>> +	return NOTIFY_OK;
>>> +}
>>> +
>>>  static int __init papr_scm_init(void)
>>>  {
>>>  	int ret;
>>>  
>>>  	ret = platform_driver_register(&papr_scm_driver);
>>> -	if (!ret)
>>> -		mce_register_notifier(&mce_ue_nb);
>>> -
>>> -	return ret;
>>> +	if (ret)
>>> +		return ret;
>>> +	mce_register_notifier(&mce_ue_nb);
>>> +	/*
>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>> +	 * can update the perf level for the node.
>>> +	 */
>>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>> +	return 0;
>>>  }
>>>  module_init(papr_scm_init);
>>>  
>>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>>> index ae5f4acf2675..7ea1017ef790 100644
>>> --- a/drivers/acpi/nfit/core.c
>>> +++ b/drivers/acpi/nfit/core.c
>>> @@ -15,6 +15,8 @@
>>>  #include <linux/sort.h>
>>>  #include <linux/io.h>
>>>  #include <linux/nd.h>
>>> +#include <linux/memory.h>
>>> +#include <linux/memory-tiers.h>
>>>  #include <asm/cacheflush.h>
>>>  #include <acpi/nfit.h>
>>>  #include "intel.h"
>>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>>  	},
>>>  };
>>>  
>>> +static int nfit_callback(struct notifier_block *self,
>>> +			 unsigned long action, void *arg)
>>> +{
>>> +	bool found = false;
>>> +	struct memory_notify *mnb = arg;
>>> +	int nid = mnb->status_change_nid;
>>> +	struct nfit_spa *nfit_spa;
>>> +	struct acpi_nfit_desc *acpi_desc;
>>> +
>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>> +		return NOTIFY_OK;
>>> +
>>> +	mutex_lock(&acpi_desc_lock);
>>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>> +		mutex_lock(&acpi_desc->init_mutex);
>>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>> +			int target_node = pxm_to_node(spa->proximity_domain);
>>> +
>>> +			if (target_node == nid) {
>>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>> +				found = true;
>>> +				break;
>>> +			}
>>> +		}
>>> +		mutex_unlock(&acpi_desc->init_mutex);
>>> +		if (found)
>>> +			break;
>>> +	}
>>> +	mutex_unlock(&acpi_desc_lock);
>>> +	return NOTIFY_OK;
>>> +}
>>> +
>>>  static __init int nfit_init(void)
>>>  {
>>>  	int ret;
>>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>>  		nfit_mce_unregister();
>>>  		destroy_workqueue(nfit_wq);
>>>  	}
>>> -
>>> +	/*
>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>> +	 * can update the perf level for the node.
>>> +	 */
>>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>  	return ret;
>>>  
>>>  }
>> 
>> I don't think that it's a good idea to set perf_level of a memory device
>> (node) via NFIT only.
>
>> 
>> For example, we may prefer HMAT over NFIT when it's available.  So the
>> perf_level should be set in dax/kmem.c based on information provided by
>> ACPI or other information sources.  ACPI can provide some functions/data
>> structures to let drivers (like dax/kmem.c) to query the properties of
>> the memory device (node).
>> 
>
> I was trying to make it architecture specific so that we have a placeholder
> to fine-tune this better. For example, ppc64 will look at device tree
> details to find the performance level and x86 will look at ACPI data structure.
> Adding that hotplug callback in dax/kmem will prevent that architecture-specific
> customization? 
>
> I would expect that callback to move to the generic ACPI layer so that even
> firmware managed CXL devices can be added to a lower tier?  I don't understand
> ACPI enough to find the right abstraction for that hotplug callback. 

I'm OK for this to be architecture specific.

But ACPI NFIT isn't enough for x86.  For example, PMEM can be added to a
virtual machine as normal memory nodes without NFIT.  Instead, PMEM is
marked via "memmap=<nn>G!<ss>G" or "efi_fake_mem=<nn>G@<ss>G:0x40000",
and dax/kmem.c is used to hot-add the memory.

So, before a more sophisticated version is implemented for x86.  The
simplest version as I suggested below works even better.

>> As the simplest first version, this can be just hard coded.
>> 
>
> If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
> get allocated pretty late when we try to online the node. 

As the simplest first version, this can be as simple as,

/* dax/kmem.c */
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
	node_devices[dev_dax->target_node]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
	/* add_memory_driver_managed() */
}

To be compatible with ppc64 version, how about make dev_dax_kmem_probe()
set perf_level only if it's uninitialized?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-25  8:35       ` Huang, Ying
@ 2022-07-25  8:42         ` Aneesh Kumar K V
  2022-07-26  2:13           ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-25  8:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/25/22 2:05 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/25/22 12:07 PM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> By default, all nodes are assigned to the default memory tier which
>>>> is the memory tier designated for nodes with DRAM
>>>>
>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>>>> appears below the default memory tier in demotion order.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>>> index 82cae08976bc..3b6164418d6f 100644
>>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>>> @@ -14,6 +14,8 @@
>>>>  #include <linux/delay.h>
>>>>  #include <linux/seq_buf.h>
>>>>  #include <linux/nd.h>
>>>> +#include <linux/memory.h>
>>>> +#include <linux/memory-tiers.h>
>>>>  
>>>>  #include <asm/plpar_wrappers.h>
>>>>  #include <asm/papr_pdsm.h>
>>>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>>>  	bool hcall_flush_required;
>>>>  
>>>>  	uint64_t bound_addr;
>>>> +	int target_node;
>>>>  
>>>>  	struct nvdimm_bus_descriptor bus_desc;
>>>>  	struct nvdimm_bus *bus;
>>>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>  	p->bus_desc.module = THIS_MODULE;
>>>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>>>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>>>  
>>>>  	/* Set the dimm command family mask to accept PDSMs */
>>>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>>>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>>>  
>>>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>>>> -	target_nid = dev_to_node(&p->pdev->dev);
>>>> +	target_nid = p->target_node;
>>>>  	online_nid = numa_map_to_online_node(target_nid);
>>>>  	ndr_desc.numa_node = online_nid;
>>>>  	ndr_desc.target_node = target_nid;
>>>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>>>  	},
>>>>  };
>>>>  
>>>> +static int papr_scm_callback(struct notifier_block *self,
>>>> +			     unsigned long action, void *arg)
>>>> +{
>>>> +	struct memory_notify *mnb = arg;
>>>> +	int nid = mnb->status_change_nid;
>>>> +	struct papr_scm_priv *p;
>>>> +
>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>> +		return NOTIFY_OK;
>>>> +
>>>> +	mutex_lock(&papr_ndr_lock);
>>>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>>>> +		if (p->target_node == nid) {
>>>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	mutex_unlock(&papr_ndr_lock);
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>>  static int __init papr_scm_init(void)
>>>>  {
>>>>  	int ret;
>>>>  
>>>>  	ret = platform_driver_register(&papr_scm_driver);
>>>> -	if (!ret)
>>>> -		mce_register_notifier(&mce_ue_nb);
>>>> -
>>>> -	return ret;
>>>> +	if (ret)
>>>> +		return ret;
>>>> +	mce_register_notifier(&mce_ue_nb);
>>>> +	/*
>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>> +	 * can update the perf level for the node.
>>>> +	 */
>>>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>> +	return 0;
>>>>  }
>>>>  module_init(papr_scm_init);
>>>>  
>>>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>>>> index ae5f4acf2675..7ea1017ef790 100644
>>>> --- a/drivers/acpi/nfit/core.c
>>>> +++ b/drivers/acpi/nfit/core.c
>>>> @@ -15,6 +15,8 @@
>>>>  #include <linux/sort.h>
>>>>  #include <linux/io.h>
>>>>  #include <linux/nd.h>
>>>> +#include <linux/memory.h>
>>>> +#include <linux/memory-tiers.h>
>>>>  #include <asm/cacheflush.h>
>>>>  #include <acpi/nfit.h>
>>>>  #include "intel.h"
>>>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>>>  	},
>>>>  };
>>>>  
>>>> +static int nfit_callback(struct notifier_block *self,
>>>> +			 unsigned long action, void *arg)
>>>> +{
>>>> +	bool found = false;
>>>> +	struct memory_notify *mnb = arg;
>>>> +	int nid = mnb->status_change_nid;
>>>> +	struct nfit_spa *nfit_spa;
>>>> +	struct acpi_nfit_desc *acpi_desc;
>>>> +
>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>> +		return NOTIFY_OK;
>>>> +
>>>> +	mutex_lock(&acpi_desc_lock);
>>>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>>> +		mutex_lock(&acpi_desc->init_mutex);
>>>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>>>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>>> +			int target_node = pxm_to_node(spa->proximity_domain);
>>>> +
>>>> +			if (target_node == nid) {
>>>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>> +				found = true;
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +		mutex_unlock(&acpi_desc->init_mutex);
>>>> +		if (found)
>>>> +			break;
>>>> +	}
>>>> +	mutex_unlock(&acpi_desc_lock);
>>>> +	return NOTIFY_OK;
>>>> +}
>>>> +
>>>>  static __init int nfit_init(void)
>>>>  {
>>>>  	int ret;
>>>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>>>  		nfit_mce_unregister();
>>>>  		destroy_workqueue(nfit_wq);
>>>>  	}
>>>> -
>>>> +	/*
>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>> +	 * can update the perf level for the node.
>>>> +	 */
>>>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>  	return ret;
>>>>  
>>>>  }
>>>
>>> I don't think that it's a good idea to set perf_level of a memory device
>>> (node) via NFIT only.
>>
>>>
>>> For example, we may prefer HMAT over NFIT when it's available.  So the
>>> perf_level should be set in dax/kmem.c based on information provided by
>>> ACPI or other information sources.  ACPI can provide some functions/data
>>> structures to let drivers (like dax/kmem.c) to query the properties of
>>> the memory device (node).
>>>
>>
>> I was trying to make it architecture specific so that we have a placeholder
>> to fine-tune this better. For example, ppc64 will look at device tree
>> details to find the performance level and x86 will look at ACPI data structure.
>> Adding that hotplug callback in dax/kmem will prevent that architecture-specific
>> customization? 
>>
>> I would expect that callback to move to the generic ACPI layer so that even
>> firmware managed CXL devices can be added to a lower tier?  I don't understand
>> ACPI enough to find the right abstraction for that hotplug callback. 
> 
> I'm OK for this to be architecture specific.
> 
> But ACPI NFIT isn't enough for x86.  For example, PMEM can be added to a
> virtual machine as normal memory nodes without NFIT.  Instead, PMEM is
> marked via "memmap=<nn>G!<ss>G" or "efi_fake_mem=<nn>G@<ss>G:0x40000",
> and dax/kmem.c is used to hot-add the memory.
> 
> So, before a more sophisticated version is implemented for x86.  The
> simplest version as I suggested below works even better.
> 
>>> As the simplest first version, this can be just hard coded.
>>>
>>
>> If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
>> get allocated pretty late when we try to online the node. 
> 
> As the simplest first version, this can be as simple as,
> 
> /* dax/kmem.c */
> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> {
> 	node_devices[dev_dax->target_node]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
> 	/* add_memory_driver_managed() */
> }
> 
> To be compatible with ppc64 version, how about make dev_dax_kmem_probe()
> set perf_level only if it's uninitialized?

That will result in kernel crash because node_devices[dev_dax->target_node] is not initialized there. 

it get allocated in add_memory_resource -> __try_online_node -> register_one_node -> __register_one_node -> node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL);

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-20  2:59 ` [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-07-25  8:54   ` Huang, Ying
  2022-07-25  8:56     ` Aneesh Kumar K V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-25  8:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> With memory tiers support we can have memory only NUMA nodes
> in the top tier from which we want to avoid promotion tracking NUMA
> faults. Update node_is_toptier to work with memory tiers.
> All NUMA nodes are by default top tier nodes. With lower memory
> tiers added we consider all memory tiers above a memory tier having
> CPU NUMA nodes as a top memory tier
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 11 +++++++++
>  include/linux/node.h         |  5 -----
>  mm/huge_memory.c             |  1 +
>  mm/memory-tiers.c            | 43 ++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                 |  1 +
>  mm/mprotect.c                |  1 +
>  6 files changed, 57 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0e58588fa066..085dd815bf73 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -20,6 +20,7 @@ extern bool numa_demotion_enabled;
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> +bool node_is_toptier(int node);
>  #else
>  static inline int next_demotion_node(int node)
>  {
> @@ -30,6 +31,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>  {
>  	*targets = NODE_MASK_NONE;
>  }
> +
> +static inline bool node_is_toptier(int node)
> +{
> +	return true;
> +}
>  #endif
>  
>  #else
> @@ -44,5 +50,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>  {
>  	*targets = NODE_MASK_NONE;
>  }
> +
> +static inline bool node_is_toptier(int node)
> +{
> +	return true;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/node.h b/include/linux/node.h
> index a2a16d4104fd..d0432db18094 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -191,9 +191,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>  
>  #define to_node(device) container_of(device, struct node, dev)
>  
> -static inline bool node_is_toptier(int node)
> -{
> -	return node_state(node, N_CPU);
> -}
> -
>  #endif /* _LINUX_NODE_H_ */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 834f288b3769..8405662646e9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -35,6 +35,7 @@
>  #include <linux/numa.h>
>  #include <linux/page_owner.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/memory-tiers.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 4a96e4213d66..f0515bfd4051 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -13,6 +13,7 @@
>  
>  struct memory_tier {
>  	struct list_head list;
> +	int id;
>  	int perf_level;
>  	nodemask_t nodelist;
>  	nodemask_t lower_tier_mask;
> @@ -26,6 +27,7 @@ static LIST_HEAD(memory_tiers);
>  static DEFINE_MUTEX(memory_tier_lock);
>  
>  #ifdef CONFIG_MIGRATION
> +static int top_tier_id;
>  /*
>   * node_demotion[] examples:
>   *
> @@ -129,6 +131,7 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>  	if (!new_memtier)
>  		return ERR_PTR(-ENOMEM);
>  
> +	new_memtier->id = perf_level >> MEMTIER_CHUNK_BITS;
>  	new_memtier->perf_level = perf_level;
>  	if (found_slot)
>  		list_add_tail(&new_memtier->list, ent);
> @@ -154,6 +157,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>  }
>  
>  #ifdef CONFIG_MIGRATION
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->id >= top_tier_id)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>  {
>  	struct memory_tier *memtier;
> @@ -304,6 +332,21 @@ static void establish_migration_targets(void)
>  			}
>  		} while (1);
>  	}
> +	/*
> +	 * Promotion is allowed from a memory tier to higher
> +	 * memory tier only if the memory tier doesn't include
> +	 * compute. We want to  skip promotion from a memory tier,
> +	 * if any node that is  part of the memory tier have CPUs.
> +	 * Once we detect such a memory tier, we consider that tier
> +	 * as top tiper from which promotion is not allowed.
> +	 */
> +	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> +		nodes_and(used, node_states[N_CPU], memtier->nodelist);
> +		if (!nodes_empty(used)) {
> +			top_tier_id = memtier->id;

I don't think we need to introduce memory tier ID for this.  We can add
a top_tier_perf_level, set it here and use it in node_is_toptier().

Best Regards,
Huang, Ying

> +			break;
> +		}
> +	}
>  	/*
>  	 * Now build the lower_tier mask for each node collecting node mask from
>  	 * all memory tier below it. This allows us to fallback demotion page
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c758c9c21d7d..1da81136eaaa 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -50,6 +50,7 @@
>  #include <linux/memory.h>
>  #include <linux/random.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/memory-tiers.h>
>  
>  #include <asm/tlbflush.h>
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..92a2fc0fa88b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -31,6 +31,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/sched/sysctl.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/memory-tiers.h>
>  #include <asm/cacheflush.h>
>  #include <asm/mmu_context.h>
>  #include <asm/tlbflush.h>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-25  8:54   ` Huang, Ying
@ 2022-07-25  8:56     ` Aneesh Kumar K V
  0 siblings, 0 replies; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-25  8:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/25/22 2:24 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> With memory tiers support we can have memory only NUMA nodes
>> in the top tier from which we want to avoid promotion tracking NUMA
>> faults. Update node_is_toptier to work with memory tiers.
>> All NUMA nodes are by default top tier nodes. With lower memory
>> tiers added we consider all memory tiers above a memory tier having
>> CPU NUMA nodes as a top memory tier
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h | 11 +++++++++
>>  include/linux/node.h         |  5 -----
>>  mm/huge_memory.c             |  1 +
>>  mm/memory-tiers.c            | 43 ++++++++++++++++++++++++++++++++++++
>>  mm/migrate.c                 |  1 +
>>  mm/mprotect.c                |  1 +
>>  6 files changed, 57 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 0e58588fa066..085dd815bf73 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -20,6 +20,7 @@ extern bool numa_demotion_enabled;
>>  #ifdef CONFIG_MIGRATION
>>  int next_demotion_node(int node);
>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>> +bool node_is_toptier(int node);
>>  #else
>>  static inline int next_demotion_node(int node)
>>  {
>> @@ -30,6 +31,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>  {
>>  	*targets = NODE_MASK_NONE;
>>  }
>> +
>> +static inline bool node_is_toptier(int node)
>> +{
>> +	return true;
>> +}
>>  #endif
>>  
>>  #else
>> @@ -44,5 +50,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>  {
>>  	*targets = NODE_MASK_NONE;
>>  }
>> +
>> +static inline bool node_is_toptier(int node)
>> +{
>> +	return true;
>> +}
>>  #endif	/* CONFIG_NUMA */
>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index a2a16d4104fd..d0432db18094 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -191,9 +191,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>>  
>>  #define to_node(device) container_of(device, struct node, dev)
>>  
>> -static inline bool node_is_toptier(int node)
>> -{
>> -	return node_state(node, N_CPU);
>> -}
>> -
>>  #endif /* _LINUX_NODE_H_ */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 834f288b3769..8405662646e9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -35,6 +35,7 @@
>>  #include <linux/numa.h>
>>  #include <linux/page_owner.h>
>>  #include <linux/sched/sysctl.h>
>> +#include <linux/memory-tiers.h>
>>  
>>  #include <asm/tlb.h>
>>  #include <asm/pgalloc.h>
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 4a96e4213d66..f0515bfd4051 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -13,6 +13,7 @@
>>  
>>  struct memory_tier {
>>  	struct list_head list;
>> +	int id;
>>  	int perf_level;
>>  	nodemask_t nodelist;
>>  	nodemask_t lower_tier_mask;
>> @@ -26,6 +27,7 @@ static LIST_HEAD(memory_tiers);
>>  static DEFINE_MUTEX(memory_tier_lock);
>>  
>>  #ifdef CONFIG_MIGRATION
>> +static int top_tier_id;
>>  /*
>>   * node_demotion[] examples:
>>   *
>> @@ -129,6 +131,7 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>>  	if (!new_memtier)
>>  		return ERR_PTR(-ENOMEM);
>>  
>> +	new_memtier->id = perf_level >> MEMTIER_CHUNK_BITS;
>>  	new_memtier->perf_level = perf_level;
>>  	if (found_slot)
>>  		list_add_tail(&new_memtier->list, ent);
>> @@ -154,6 +157,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>  }
>>  
>>  #ifdef CONFIG_MIGRATION
>> +bool node_is_toptier(int node)
>> +{
>> +	bool toptier;
>> +	pg_data_t *pgdat;
>> +	struct memory_tier *memtier;
>> +
>> +	pgdat = NODE_DATA(node);
>> +	if (!pgdat)
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	memtier = rcu_dereference(pgdat->memtier);
>> +	if (!memtier) {
>> +		toptier = true;
>> +		goto out;
>> +	}
>> +	if (memtier->id >= top_tier_id)
>> +		toptier = true;
>> +	else
>> +		toptier = false;
>> +out:
>> +	rcu_read_unlock();
>> +	return toptier;
>> +}
>> +
>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>  {
>>  	struct memory_tier *memtier;
>> @@ -304,6 +332,21 @@ static void establish_migration_targets(void)
>>  			}
>>  		} while (1);
>>  	}
>> +	/*
>> +	 * Promotion is allowed from a memory tier to higher
>> +	 * memory tier only if the memory tier doesn't include
>> +	 * compute. We want to  skip promotion from a memory tier,
>> +	 * if any node that is  part of the memory tier have CPUs.
>> +	 * Once we detect such a memory tier, we consider that tier
>> +	 * as top tiper from which promotion is not allowed.
>> +	 */
>> +	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> +		nodes_and(used, node_states[N_CPU], memtier->nodelist);
>> +		if (!nodes_empty(used)) {
>> +			top_tier_id = memtier->id;
> 
> I don't think we need to introduce memory tier ID for this.  We can add
> a top_tier_perf_level, set it here and use it in node_is_toptier().
> 

Sure. Will switch to that in the next iteration.

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-25  8:42         ` Aneesh Kumar K V
@ 2022-07-26  2:13           ` Huang, Ying
  2022-07-27  4:31             ` Aneesh Kumar K.V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  2:13 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/25/22 2:05 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/25/22 12:07 PM, Huang, Ying wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> By default, all nodes are assigned to the default memory tier which
>>>>> is the memory tier designated for nodes with DRAM
>>>>>
>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>>>>> appears below the default memory tier in demotion order.
>>>>>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>>>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>>>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>> index 82cae08976bc..3b6164418d6f 100644
>>>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>> @@ -14,6 +14,8 @@
>>>>>  #include <linux/delay.h>
>>>>>  #include <linux/seq_buf.h>
>>>>>  #include <linux/nd.h>
>>>>> +#include <linux/memory.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>>  
>>>>>  #include <asm/plpar_wrappers.h>
>>>>>  #include <asm/papr_pdsm.h>
>>>>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>>>>  	bool hcall_flush_required;
>>>>>  
>>>>>  	uint64_t bound_addr;
>>>>> +	int target_node;
>>>>>  
>>>>>  	struct nvdimm_bus_descriptor bus_desc;
>>>>>  	struct nvdimm_bus *bus;
>>>>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>  	p->bus_desc.module = THIS_MODULE;
>>>>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>>>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>>>>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>>>>  
>>>>>  	/* Set the dimm command family mask to accept PDSMs */
>>>>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>>>>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>>>>  
>>>>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>>>>> -	target_nid = dev_to_node(&p->pdev->dev);
>>>>> +	target_nid = p->target_node;
>>>>>  	online_nid = numa_map_to_online_node(target_nid);
>>>>>  	ndr_desc.numa_node = online_nid;
>>>>>  	ndr_desc.target_node = target_nid;
>>>>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>>>>  	},
>>>>>  };
>>>>>  
>>>>> +static int papr_scm_callback(struct notifier_block *self,
>>>>> +			     unsigned long action, void *arg)
>>>>> +{
>>>>> +	struct memory_notify *mnb = arg;
>>>>> +	int nid = mnb->status_change_nid;
>>>>> +	struct papr_scm_priv *p;
>>>>> +
>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>> +		return NOTIFY_OK;
>>>>> +
>>>>> +	mutex_lock(&papr_ndr_lock);
>>>>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>>>>> +		if (p->target_node == nid) {
>>>>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	mutex_unlock(&papr_ndr_lock);
>>>>> +	return NOTIFY_OK;
>>>>> +}
>>>>> +
>>>>>  static int __init papr_scm_init(void)
>>>>>  {
>>>>>  	int ret;
>>>>>  
>>>>>  	ret = platform_driver_register(&papr_scm_driver);
>>>>> -	if (!ret)
>>>>> -		mce_register_notifier(&mce_ue_nb);
>>>>> -
>>>>> -	return ret;
>>>>> +	if (ret)
>>>>> +		return ret;
>>>>> +	mce_register_notifier(&mce_ue_nb);
>>>>> +	/*
>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>> +	 * can update the perf level for the node.
>>>>> +	 */
>>>>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>> +	return 0;
>>>>>  }
>>>>>  module_init(papr_scm_init);
>>>>>  
>>>>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>>>>> index ae5f4acf2675..7ea1017ef790 100644
>>>>> --- a/drivers/acpi/nfit/core.c
>>>>> +++ b/drivers/acpi/nfit/core.c
>>>>> @@ -15,6 +15,8 @@
>>>>>  #include <linux/sort.h>
>>>>>  #include <linux/io.h>
>>>>>  #include <linux/nd.h>
>>>>> +#include <linux/memory.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>>  #include <asm/cacheflush.h>
>>>>>  #include <acpi/nfit.h>
>>>>>  #include "intel.h"
>>>>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>>>>  	},
>>>>>  };
>>>>>  
>>>>> +static int nfit_callback(struct notifier_block *self,
>>>>> +			 unsigned long action, void *arg)
>>>>> +{
>>>>> +	bool found = false;
>>>>> +	struct memory_notify *mnb = arg;
>>>>> +	int nid = mnb->status_change_nid;
>>>>> +	struct nfit_spa *nfit_spa;
>>>>> +	struct acpi_nfit_desc *acpi_desc;
>>>>> +
>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>> +		return NOTIFY_OK;
>>>>> +
>>>>> +	mutex_lock(&acpi_desc_lock);
>>>>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>>>> +		mutex_lock(&acpi_desc->init_mutex);
>>>>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>>>>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>>>> +			int target_node = pxm_to_node(spa->proximity_domain);
>>>>> +
>>>>> +			if (target_node == nid) {
>>>>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>> +				found = true;
>>>>> +				break;
>>>>> +			}
>>>>> +		}
>>>>> +		mutex_unlock(&acpi_desc->init_mutex);
>>>>> +		if (found)
>>>>> +			break;
>>>>> +	}
>>>>> +	mutex_unlock(&acpi_desc_lock);
>>>>> +	return NOTIFY_OK;
>>>>> +}
>>>>> +
>>>>>  static __init int nfit_init(void)
>>>>>  {
>>>>>  	int ret;
>>>>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>>>>  		nfit_mce_unregister();
>>>>>  		destroy_workqueue(nfit_wq);
>>>>>  	}
>>>>> -
>>>>> +	/*
>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>> +	 * can update the perf level for the node.
>>>>> +	 */
>>>>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>>  	return ret;
>>>>>  
>>>>>  }
>>>>
>>>> I don't think that it's a good idea to set perf_level of a memory device
>>>> (node) via NFIT only.
>>>
>>>>
>>>> For example, we may prefer HMAT over NFIT when it's available.  So the
>>>> perf_level should be set in dax/kmem.c based on information provided by
>>>> ACPI or other information sources.  ACPI can provide some functions/data
>>>> structures to let drivers (like dax/kmem.c) to query the properties of
>>>> the memory device (node).
>>>>
>>>
>>> I was trying to make it architecture specific so that we have a placeholder
>>> to fine-tune this better. For example, ppc64 will look at device tree
>>> details to find the performance level and x86 will look at ACPI data structure.
>>> Adding that hotplug callback in dax/kmem will prevent that architecture-specific
>>> customization? 
>>>
>>> I would expect that callback to move to the generic ACPI layer so that even
>>> firmware managed CXL devices can be added to a lower tier?  I don't understand
>>> ACPI enough to find the right abstraction for that hotplug callback. 
>> 
>> I'm OK for this to be architecture specific.
>> 
>> But ACPI NFIT isn't enough for x86.  For example, PMEM can be added to a
>> virtual machine as normal memory nodes without NFIT.  Instead, PMEM is
>> marked via "memmap=<nn>G!<ss>G" or "efi_fake_mem=<nn>G@<ss>G:0x40000",
>> and dax/kmem.c is used to hot-add the memory.
>> 
>> So, before a more sophisticated version is implemented for x86.  The
>> simplest version as I suggested below works even better.
>> 
>>>> As the simplest first version, this can be just hard coded.
>>>>
>>>
>>> If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
>>> get allocated pretty late when we try to online the node. 
>> 
>> As the simplest first version, this can be as simple as,
>> 
>> /* dax/kmem.c */
>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> {
>> 	node_devices[dev_dax->target_node]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>> 	/* add_memory_driver_managed() */
>> }
>> 
>> To be compatible with ppc64 version, how about make dev_dax_kmem_probe()
>> set perf_level only if it's uninitialized?
>
> That will result in kernel crash because node_devices[dev_dax->target_node] is not initialized there. 
>
> it get allocated in add_memory_resource -> __try_online_node ->
> register_one_node -> __register_one_node -> node_devices[nid] =
> kzalloc(sizeof(struct node), GFP_KERNEL);

Ah, right!  So we need some other way to do that, for example, a global
array as follows,

  int node_perf_levels[MAX_NUMNODES];

And, I think that we need to consider the memory type here too.  As
suggested by Johannes, memory type describes a set of memory devices
(NUMA nodes) with same performance character (that is, abstract distance
or perf level).  The data structure can be something as follows,

  struct memory_type {
        int perf_level;
        struct list_head tier_sibling;
        nodemask_t nodes;
  };

The memory_type should be created and managed by the device drivers
(e.g., dax/kmem) which manages the memory devices.  In the future, we
will represent it in sysfs, and a per-memory_type knob will be provided
to offset the perf_level of all memory devices managed by the
memory_type.

With memory_type, the memory_tier becomes,

  struct memory_tier {
        int perf_level_start;
        struct list_head sibling;
        struct list_head memory_types;
  };

And we need an array to map from nid to memory_type, e.g., as follows,

  struct memory_type *node_memory_types[MAX_NUMNODES];

We need to manage the memory_type in device drivers, instead of ACPI or
device tree callbacks.

Because memory_type is an important part of the explicit memory tier
implementation and may influence the design, I suggest to include it in
the implementation now.  It appears not too complex anyway.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-20  2:59 ` [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-26  3:53   ` Huang, Ying
  2022-07-26 11:59     ` Aneesh Kumar K V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  3:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier,
                          ~~~~~~~~

Because we will change the semantics of "top tier" later in the
patchset, I suggest to replace "top tier" with "highest tier" in the
patch description to avoid potential confusing.

> and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
                                  ~~~~~~~~~

s/interface/implementation/

> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem

Per my understanding, CXL attached DRAM may be put in a lower tier in
some use cases, such as the one in the following paper from Meta,

https://arxiv.org/pdf/2206.02878.pdf

So I suggest to remove this use case here.

> or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
                                                              ~~~~~~~~~~~~~~

nodes with shortest distance

> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
                                          ~~~~~~~~~~

The demotion path is generated automatically instead of hard-coded.  So
I suggest to remove "hard-coded", "too strict" should be enough to
describe the current issue.

> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.

I can understand the description above.  But I think an example may be
easier to be understood.

> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.

We don't provide user space interface in this patchset, so we should
remove this paragraph?

In addition to the above, I think the explicit memory tier make it easier
to manage memory tiers and build additional mechanisms on top of it,
e.g., demotion, promotion, partitioning, interleaving, etc.

> This patch series address the above by defining memory tiers explicitly.
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its performance
> level. A memory tier corresponds to a range of performance levels. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> This patch configures the range/chunk size to be 128. The default DRAM
> performance level is 512. We can have 4 memory tiers below the default DRAM
> performance level which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
> Slower memory devices like persistent memory will have performance levels below
> the default DRAM level and hence will be placed in these 4 lower tiers.
>
> While reclaim we migrate pages from fast(higher) tiers to slow(lower) tiers when
> the fast(higher) tier is under memory pressure.

This appears not related to the patch?

> A kernel parameter is provided to override the default memory tier.
>
> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  18 +++++++
>  include/linux/node.h         |   6 +++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 101 +++++++++++++++++++++++++++++++++++
>  4 files changed, 126 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..f28f9910a4e7
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_NUMA
> +/*
> + * Each tier cover a performance level chunk size of 128
> + */
> +#define MEMTIER_CHUNK_BITS	7

Instead of playing with BITS always, it may be easier to use _SIZE too.

#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)

> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +#define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
> +/* leave one tier below this slow pmem */
> +#define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
> +
> +#endif	/* CONFIG_NUMA */
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 40d641a8bfb0..a2a16d4104fd 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -92,6 +92,12 @@ struct node {
>  	struct list_head cache_attrs;
>  	struct device *cache_dev;
>  #endif
> +	/*
> +	 * For memory devices, perf_level describes
> +	 * the device performance and how it should be used
> +	 * while building a memory hierarchy.
> +	 */
> +	int perf_level;

Think again, I found that "perf_level" may be not the best abstraction
of the performance of memory devices.  In concept, it's an abstraction of the memory
bandwidth.  But it will not reflect the memory latency.

Instead, the previous proposed "abstract_distance" is an abstraction of
the memory latency.  Per my understanding, the memory latency has more
direct influence on system performance.  And because the latency of the
memory device will increase if the memory accessing throughput nears its
max bandwidth, so the memory bandwidth can be reflected in the "abstract
distance" too.  That is, the "abstract distance" is an abstraction of
the memory latency under the expected memory accessing throughput.  The
"offset" to the default "abstract distance" reflects the different
expected memory accessing throughput.

So, I think we need some kind of abstraction of the memory latency
instead of memory bandwidth, e.g., "abstract distance".

>  };
>  
>  struct memory_block;
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..61bb84c54091
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,101 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/moduleparam.h>
> +#include <linux/node.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	struct list_head list;
> +	int perf_level;

Because memory_tier corresponds to a range of perf_levels, I think the
better name is "perf_level_start".  And, we can add some comments about
the range, that is, perf_level_start .. per_level_start + MEMTIER_CHUNK_SIZE.

> +	nodemask_t nodelist;

I don't think nodelist is good name here.  It's a bitmask instead of
linked list.  I suggest to use "nodemask" or "nodes".

> +};
> +
> +static LIST_HEAD(memory_tiers);
> +static DEFINE_MUTEX(memory_tier_lock);
> +
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +static unsigned int default_memtier_perf_level = MEMTIER_PERF_LEVEL_DRAM;
> +core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);

Do you have any existing use case of the kernel parameter?  If not, we
can add some kind of knob when the use cases are clear.

We don't even need default_memtier_perf_level for now, why not just use
MEMTIER_PERF_LEVEL_DRAM?

> +/*
> + * performance levels are grouped into memtiers each of chunk size
> + * memtier_perf_chunk
> + */
> +static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
> +{
> +	bool found_slot = false;
> +	struct list_head *ent;
> +	struct memory_tier *memtier, *new_memtier;
> +	unsigned int memtier_perf_chunk_size = 1 << MEMTIER_CHUNK_BITS;
> +	/*
> +	 * zero is special in that it indicates uninitialized
> +	 * perf level by respective driver. Pick default memory
> +	 * tier perf level for that.
> +	 */
> +	if (!perf_level)
> +		perf_level = default_memtier_perf_level;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	perf_level = round_down(perf_level, memtier_perf_chunk_size);
> +	list_for_each(ent, &memory_tiers) {

We can use list_for_each_entry(memtier, &memory_tiers) here.  When we
need to insert into list below, we can use

  list_add_tail(&new_memtier->list, &memtier->list);

> +		memtier = list_entry(ent, struct memory_tier, list);
> +		if (perf_level == memtier->perf_level) {
> +			return memtier;
> +		} else if (perf_level < memtier->perf_level) {
> +			found_slot = true;
> +			break;
> +		}
> +	}
> +
> +	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!new_memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	new_memtier->perf_level = perf_level;
> +	if (found_slot)
> +		list_add_tail(&new_memtier->list, ent);
> +	else
> +		list_add_tail(&new_memtier->list, &memory_tiers);
> +	return new_memtier;
> +}
> +
> +static int __init memory_tier_init(void)
> +{
> +	int node;
> +	struct memory_tier *memtier;
> +
> +	/*
> +	 * Since this is early during  boot, we could avoid
> +	 * holding memtory_tier_lock. But keep it simple by
> +	 * holding locks. So we can add lock held debug checks
> +	 * in other functions.
> +	 */

I don't think the comments above is necessary.  Acquiring a uncontended
lock in a slow path isn't a big deal.

> +	mutex_lock(&memory_tier_lock);
> +	memtier = find_create_memory_tier(default_memtier_perf_level);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +
> +	/* CPU only nodes are not part of memory tiers. */
> +	memtier->nodelist = node_states[N_MEMORY];
> +
> +	/*
> +	 * nodes that are already online and that doesn't
> +	 * have perf level assigned is assigned a default perf
> +	 * level.
> +	 */
> +	for_each_node_state(node, N_MEMORY) {
> +		struct node *node_property = node_devices[node];
> +
> +		if (!node_property->perf_level)
> +			node_property->perf_level = default_memtier_perf_level;
> +	}
> +	mutex_unlock(&memory_tier_lock);

Another implementation is to call "online" function for each node with
memory.  In this way, the same code path can be used for static and
dynamic onlined node.

> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-20  2:59 ` [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-26  4:03   ` Huang, Ying
  2022-07-26 12:03     ` Aneesh Kumar K V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  4:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> If the new NUMA node onlined doesn't have a performance level assigned,
> the kernel adds the NUMA node to default memory tier.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  1 +
>  mm/memory-tiers.c            | 75 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 76 insertions(+)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index ef380a39db3a..3d5f14d57ae6 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -14,6 +14,7 @@
>  #define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>  /* leave one tier below this slow pmem */
>  #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
> +#define MEMTIER_HOTPLUG_PRIO	100
>  
>  extern bool numa_demotion_enabled;
>  
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 41a21cc5ae55..cc3a47ec18e4 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -5,6 +5,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/moduleparam.h>
>  #include <linux/node.h>
> +#include <linux/memory.h>
>  #include <linux/memory-tiers.h>
>  
>  struct memory_tier {
> @@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>  	return new_memtier;
>  }
>  
> +static struct memory_tier *__node_get_memory_tier(int node)
> +{
> +	struct memory_tier *memtier;
> +
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		if (node_isset(node, memtier->nodelist))
> +			return memtier;
> +	}
> +	return NULL;
> +}
> +
> +static void init_node_memory_tier(int node)

set_node_memory_tier()?

> +{
> +	int perf_level;
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +
> +	memtier = __node_get_memory_tier(node);
> +	if (!memtier) {
> +		perf_level = node_devices[node]->perf_level;
> +		memtier = find_create_memory_tier(perf_level);
> +		node_set(node, memtier->nodelist);
> +	}
> +	mutex_unlock(&memory_tier_lock);
> +}
> +
> +static void clear_node_memory_tier(int node)
> +{
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	memtier = __node_get_memory_tier(node);
> +	if (memtier)
> +		node_clear(node, memtier->nodelist);

When memtier->nodelist becomes empty, we need to free memtier?

> +	mutex_unlock(&memory_tier_lock);
> +}
> +
> +/*
> + * This runs whether reclaim-based migration is enabled or not,
> + * which ensures that the user can turn reclaim-based migration
> + * at any time without needing to recalculate migration targets.
> + */

The comments doesn't apply here.

> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *_arg)

Now we are building memory tiers instead of working on demotion.  So I
think we should rename the function to memtier_hotplug_callback().

> +{
> +	struct memory_notify *arg = _arg;
> +
> +	/*
> +	 * Only update the node migration order when a node is
> +	 * changing status, like online->offline.
> +	 */
> +	if (arg->status_change_nid < 0)
> +		return notifier_from_errno(0);
> +
> +	switch (action) {
> +	case MEM_OFFLINE:
> +		clear_node_memory_tier(arg->status_change_nid);
> +		break;
> +	case MEM_ONLINE:
> +		init_node_memory_tier(arg->status_change_nid);
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
> +}
> +
> +static void __init migrate_on_reclaim_init(void)
> +{
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
> +}

I suggest to call hotplug_memory_notifier() in memory_tier_init()
directly.  We are not working on demotion here.

> +
>  static int __init memory_tier_init(void)
>  {
>  	int node;
> @@ -96,6 +169,8 @@ static int __init memory_tier_init(void)
>  			node_property->perf_level = default_memtier_perf_level;
>  	}
>  	mutex_unlock(&memory_tier_lock);
> +
> +	migrate_on_reclaim_init();
>  	return 0;
>  }
>  subsys_initcall(memory_tier_init);

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
  2022-07-20  3:38   ` Aneesh Kumar K.V
  2022-07-21  0:02   ` kernel test robot
@ 2022-07-26  7:44   ` Huang, Ying
  2022-07-26 12:30     ` Aneesh Kumar K V
  2 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  7:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> This patch switch the demotion target building logic to use memory tiers
> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
> default memory tier and additional memory tiers will be added by drivers like
> dax kmem.
>
> This patch builds the demotion target for a NUMA node by looking at all
> memory tiers below the tier to which the NUMA node belongs. The closest node
> in the immediately following memory tier is used as a demotion target.
>
> Since we are now only building demotion target for N_MEMORY NUMA nodes
> the CPU hotplug calls are removed in this patch.
>
> A new memory tier can be inserted into the tier hierarchy for a new set of nodes
> without affecting the node assignment of any existing memtier, provided that
> there is enough gap in the performance level values for the new memtier.
>
> The absolute value of performance level of a memtier doesn't necessarily carry
> any meaning. Its value relative to other memtiers decides the level of this
> memtier in the tier hierarchy.

The above 2 paragraphs appear not related to the patch.

> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  12 ++
>  include/linux/migrate.h      |  13 --
>  mm/memory-tiers.c            | 218 ++++++++++++++++++-
>  mm/migrate.c                 | 394 -----------------------------------
>  mm/vmstat.c                  |   4 -
>  5 files changed, 229 insertions(+), 412 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 3d5f14d57ae6..852e86bd0a23 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -17,9 +17,21 @@
>  #define MEMTIER_HOTPLUG_PRIO	100
>  
>  extern bool numa_demotion_enabled;
> +#ifdef CONFIG_MIGRATION
> +int next_demotion_node(int node);
> +#else
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
> +#endif
>  
>  #else
>  
>  #define numa_demotion_enabled	false
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 43e737215f33..93fab62e6548 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  
>  #endif /* CONFIG_MIGRATION */
>  
> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> -extern void set_migration_target_nodes(void);
> -extern void migrate_on_reclaim_init(void);
> -extern int next_demotion_node(int node);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -static inline void migrate_on_reclaim_init(void) {}
> -static inline int next_demotion_node(int node)
> -{
> -        return NUMA_NO_NODE;
> -}
> -#endif
> -
>  #ifdef CONFIG_COMPACTION
>  extern int PageMovable(struct page *page);
>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index cc3a47ec18e4..a8cfe2ca3903 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -6,17 +6,88 @@
>  #include <linux/moduleparam.h>
>  #include <linux/node.h>
>  #include <linux/memory.h>
> +#include <linux/random.h>
>  #include <linux/memory-tiers.h>
>  
> +#include "internal.h"
> +
>  struct memory_tier {
>  	struct list_head list;
>  	int perf_level;
>  	nodemask_t nodelist;
>  };
>  
> +struct demotion_nodes {
> +	nodemask_t preferred;
> +};
> +
>  static LIST_HEAD(memory_tiers);
>  static DEFINE_MUTEX(memory_tier_lock);
>  
> +#ifdef CONFIG_MIGRATION
> +/*
> + * node_demotion[] examples:
> + *
> + * Example 1:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> + *
> + * node distances:
> + * node   0    1    2    3
> + *    0  10   20   30   40
> + *    1  20   10   40   30
> + *    2  30   40   10   40
> + *    3  40   30   40   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-1
> + * memory_tiers[2] = 2-3

We don't have memory_tiers array now, and we don't use static memory
tier ID too.  So I suggest to change the above to,

memory_tier0: 0-1
memory_tier1: 2-3

> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 3
> + * node_demotion[2].preferred = <empty>
> + * node_demotion[3].preferred = <empty>
> + *
> + * Example 2:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   30
> + *    2  30   30   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-2
> + * memory_tiers[2] = <empty>
> + *
> + * node_demotion[0].preferred = <empty>
> + * node_demotion[1].preferred = <empty>
> + * node_demotion[2].preferred = <empty>
> + *
> + * Example 3:
> + *
> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   40
> + *    2  30   40   10
> + *
> + * memory_tiers[0] = 1
> + * memory_tiers[1] = 0
> + * memory_tiers[2] = 2
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 0
> + * node_demotion[2].preferred = <empty>
> + *
> + */
> +static struct demotion_nodes *node_demotion __read_mostly;
> +#endif /* CONFIG_MIGRATION */
> +
>  /*
>   * For now let's have 4 memory tier below default DRAM tier.
>   */
> @@ -76,6 +147,136 @@ static struct memory_tier *__node_get_memory_tier(int node)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_MIGRATION
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * Return: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	target = node_random(&nd->preferred);

In one of the most common cases, nodes_weight(&nd->preferred) == 1.
Where, get_random_int() in node_random() just wastes CPU cycles and
random entropy.  So the original struct demotion_nodes implementation
appears better.

  struct demotion_nodes {
         unsigned short nr;
         short nodes[DEMOTION_TARGET_NODES];
  };

> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> +/* Disable reclaim-based migration. */
> +static void __disable_all_migrate_targets(void)

How about rename "migrate" to "demote" to make it more specific?

> +{
> +	int node;
> +
> +	for_each_node_state(node, N_MEMORY)
> +		node_demotion[node].preferred = NODE_MASK_NONE;
> +}
> +
> +static void disable_all_migrate_targets(void)
> +{
> +	__disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 */
> +	synchronize_rcu();
> +}
> +/*
> + * Find an automatic demotion target for all memory
> + * nodes. Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static void establish_migration_targets(void)
> +{
> +	struct memory_tier *memtier;
> +	struct demotion_nodes *nd;
> +	int target = NUMA_NO_NODE, node;
> +	int distance, best_distance;
> +	nodemask_t used;
> +
> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
> +		return;
> +
> +	disable_all_migrate_targets();
> +
> +	for_each_node_state(node, N_MEMORY) {
> +		best_distance = -1;
> +		nd = &node_demotion[node];
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
> +			continue;
> +		/*
> +		 * Get the next memtier to find the  demotion node list.
> +		 */
> +		memtier = list_next_entry(memtier, list);
> +
> +		/*
> +		 * find_next_best_node, use 'used' nodemask as a skip list.
> +		 * Add all memory nodes except the selected memory tier
> +		 * nodelist to skip list so that we find the best node from the
> +		 * memtier nodelist.
> +		 */
> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
> +
> +		/*
> +		 * Find all the nodes in the memory tier node list of same best distance.
> +		 * add them to the preferred mask. We randomly select between nodes
> +		 * in the preferred mask when allocating pages during demotion.
> +		 */
> +		do {
> +			target = find_next_best_node(node, &used);
> +			if (target == NUMA_NO_NODE)
> +				break;
> +
> +			distance = node_distance(node, target);
> +			if (distance == best_distance || best_distance == -1) {
> +				best_distance = distance;
> +				node_set(target, nd->preferred);
> +			} else {
> +				break;
> +			}
> +		} while (1);
> +	}
> +}
> +#else
> +static inline void disable_all_migrate_targets(void) {}
> +static inline void establish_migration_targets(void) {}
> +#endif /* CONFIG_MIGRATION */
> +
>  static void init_node_memory_tier(int node)
>  {
>  	int perf_level;
> @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
>  	mutex_lock(&memory_tier_lock);
>  
>  	memtier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> +	 * will have skipped this node.
> +	 */
>  	if (!memtier) {
>  		perf_level = node_devices[node]->perf_level;
>  		memtier = find_create_memory_tier(perf_level);
>  		node_set(node, memtier->nodelist);
>  	}
> +	establish_migration_targets();

Why combines memory tiers establishing with demotion targets building?
I think that it's better to separate them.   For example, if we move a
set of NUMA node from one memory tier to another memory tier, we only
need to run establish_migration_targets() once after moving all nodes.

>  	mutex_unlock(&memory_tier_lock);
>  }
>  
> @@ -98,8 +307,10 @@ static void clear_node_memory_tier(int node)
>  
>  	mutex_lock(&memory_tier_lock);
>  	memtier = __node_get_memory_tier(node);
> -	if (memtier)
> +	if (memtier) {
>  		node_clear(node, memtier->nodelist);
> +		establish_migration_targets();
> +	}
>  	mutex_unlock(&memory_tier_lock);
>  }
>  
> @@ -134,6 +345,11 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>  
>  static void __init migrate_on_reclaim_init(void)
>  {
> +	if (IS_ENABLED(CONFIG_MIGRATION)) {
> +		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),

Why allocate MAX_NUMNODES instead of nr_node_ids as before?

> +					GFP_KERNEL);
> +		WARN_ON(!node_demotion);
> +	}
>  	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>  }
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fce7d4a9e940..c758c9c21d7d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -/*
> - * node_demotion[] example:
> - *
> - * Consider a system with two sockets.  Each socket has
> - * three classes of memory attached: fast, medium and slow.
> - * Each memory class is placed in its own NUMA node.  The
> - * CPUs are placed in the node with the "fast" memory.  The
> - * 6 NUMA nodes (0-5) might be split among the sockets like
> - * this:
> - *
> - *	Socket A: 0, 1, 2
> - *	Socket B: 3, 4, 5
> - *
> - * When Node 0 fills up, its memory should be migrated to
> - * Node 1.  When Node 1 fills up, it should be migrated to
> - * Node 2.  The migration path start on the nodes with the
> - * processors (since allocations default to this node) and
> - * fast memory, progress through medium and end with the
> - * slow memory:
> - *
> - *	0 -> 1 -> 2 -> stop
> - *	3 -> 4 -> 5 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *
> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
> - *
> - * Moreover some systems may have multiple slow memory nodes.
> - * Suppose a system has one socket with 3 memory nodes, node 0
> - * is fast memory type, and node 1/2 both are slow memory
> - * type, and the distance between fast memory node and slow
> - * memory node is same. So the migration path should be:
> - *
> - *	0 -> 1/2 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
> - */
> -
> -/*
> - * Writes to this array occur without locking.  Cycles are
> - * not allowed: Node X demotes to Y which demotes to X...
> - *
> - * If multiple reads are performed, a single rcu_read_lock()
> - * must be held over all reads to ensure that no cycles are
> - * observed.
> - */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -	unsigned short nr;
> -	short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> -
> -/**
> - * next_demotion_node() - Get the next node in the demotion path
> - * @node: The starting node to lookup the next node
> - *
> - * Return: node id for next memory node in the demotion path hierarchy
> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> - * @node online or guarantee that it *continues* to be the next demotion
> - * target.
> - */
> -int next_demotion_node(int node)
> -{
> -	struct demotion_nodes *nd;
> -	unsigned short target_nr, index;
> -	int target;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	/*
> -	 * node_demotion[] is updated without excluding this
> -	 * function from running.  RCU doesn't provide any
> -	 * compiler barriers, so the READ_ONCE() is required
> -	 * to avoid compiler reordering or read merging.
> -	 *
> -	 * Make sure to use RCU over entire code blocks if
> -	 * node_demotion[] reads need to be consistent.
> -	 */
> -	rcu_read_lock();
> -	target_nr = READ_ONCE(nd->nr);
> -
> -	switch (target_nr) {
> -	case 0:
> -		target = NUMA_NO_NODE;
> -		goto out;
> -	case 1:
> -		index = 0;
> -		break;
> -	default:
> -		/*
> -		 * If there are multiple target nodes, just select one
> -		 * target node randomly.
> -		 *
> -		 * In addition, we can also use round-robin to select
> -		 * target node, but we should introduce another variable
> -		 * for node_demotion[] to record last selected target node,
> -		 * that may cause cache ping-pong due to the changing of
> -		 * last target node. Or introducing per-cpu data to avoid
> -		 * caching issue, which seems more complicated. So selecting
> -		 * target node randomly seems better until now.
> -		 */
> -		index = get_random_int() % target_nr;
> -		break;
> -	}
> -
> -	target = READ_ONCE(nd->nodes[index]);
> -
> -out:
> -	rcu_read_unlock();
> -	return target;
> -}
> -
> -/* Disable reclaim-based migration. */
> -static void __disable_all_migrate_targets(void)
> -{
> -	int node, i;
> -
> -	if (!node_demotion)
> -		return;
> -
> -	for_each_online_node(node) {
> -		node_demotion[node].nr = 0;
> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
> -	}
> -}
> -
> -static void disable_all_migrate_targets(void)
> -{
> -	__disable_all_migrate_targets();
> -
> -	/*
> -	 * Ensure that the "disable" is visible across the system.
> -	 * Readers will see either a combination of before+disable
> -	 * state or disable+after.  They will never see before and
> -	 * after state together.
> -	 *
> -	 * The before+after state together might have cycles and
> -	 * could cause readers to do things like loop until this
> -	 * function finishes.  This ensures they can only see a
> -	 * single "bad" read and would, for instance, only loop
> -	 * once.
> -	 */
> -	synchronize_rcu();
> -}
> -
> -/*
> - * Find an automatic demotion target for 'node'.
> - * Failing here is OK.  It might just indicate
> - * being at the end of a chain.
> - */
> -static int establish_migrate_target(int node, nodemask_t *used,
> -				    int best_distance)
> -{
> -	int migration_target, index, val;
> -	struct demotion_nodes *nd;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	migration_target = find_next_best_node(node, used);
> -	if (migration_target == NUMA_NO_NODE)
> -		return NUMA_NO_NODE;
> -
> -	/*
> -	 * If the node has been set a migration target node before,
> -	 * which means it's the best distance between them. Still
> -	 * check if this node can be demoted to other target nodes
> -	 * if they have a same best distance.
> -	 */
> -	if (best_distance != -1) {
> -		val = node_distance(node, migration_target);
> -		if (val > best_distance)
> -			goto out_clear;
> -	}
> -
> -	index = nd->nr;
> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
> -		      "Exceeds maximum demotion target nodes\n"))
> -		goto out_clear;
> -
> -	nd->nodes[index] = migration_target;
> -	nd->nr++;
> -
> -	return migration_target;
> -out_clear:
> -	node_clear(migration_target, *used);
> -	return NUMA_NO_NODE;
> -}
> -
> -/*
> - * When memory fills up on a node, memory contents can be
> - * automatically migrated to another node instead of
> - * discarded at reclaim.
> - *
> - * Establish a "migration path" which will start at nodes
> - * with CPUs and will follow the priorities used to build the
> - * page allocator zonelists.
> - *
> - * The difference here is that cycles must be avoided.  If
> - * node0 migrates to node1, then neither node1, nor anything
> - * node1 migrates to can migrate to node0. Also one node can
> - * be migrated to multiple nodes if the target nodes all have
> - * a same best-distance against the source node.
> - *
> - * This function can run simultaneously with readers of
> - * node_demotion[].  However, it can not run simultaneously
> - * with itself.  Exclusion is provided by memory hotplug events
> - * being single-threaded.
> - */
> -static void __set_migration_target_nodes(void)
> -{
> -	nodemask_t next_pass;
> -	nodemask_t this_pass;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> -	int node, best_distance;
> -
> -	/*
> -	 * Avoid any oddities like cycles that could occur
> -	 * from changes in the topology.  This will leave
> -	 * a momentary gap when migration is disabled.
> -	 */
> -	disable_all_migrate_targets();
> -
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> -
> -	for_each_node_mask(node, this_pass) {
> -		best_distance = -1;
> -
> -		/*
> -		 * Try to set up the migration path for the node, and the target
> -		 * migration nodes can be multiple, so doing a loop to find all
> -		 * the target nodes if they all have a best node distance.
> -		 */
> -		do {
> -			int target_node =
> -				establish_migrate_target(node, &used_targets,
> -							 best_distance);
> -
> -			if (target_node == NUMA_NO_NODE)
> -				break;
> -
> -			if (best_distance == -1)
> -				best_distance = node_distance(node, target_node);
> -
> -			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> -			 */
> -			node_set(target_node, next_pass);
> -		} while (1);
> -	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> -}
> -
> -/*
> - * For callers that do not hold get_online_mems() already.
> - */
> -void set_migration_target_nodes(void)
> -{
> -	get_online_mems();
> -	__set_migration_target_nodes();
> -	put_online_mems();
> -}
> -
> -/*
> - * This leaves migrate-on-reclaim transiently disabled between
> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
> - * whether reclaim-based migration is enabled or not, which
> - * ensures that the user can turn reclaim-based migration at
> - * any time without needing to recalculate migration targets.
> - *
> - * These callbacks already hold get_online_mems().  That is why
> - * __set_migration_target_nodes() can be used as opposed to
> - * set_migration_target_nodes().
> - */
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> -						 unsigned long action, void *_arg)
> -{
> -	struct memory_notify *arg = _arg;
> -
> -	/*
> -	 * Only update the node migration order when a node is
> -	 * changing status, like online->offline.  This avoids
> -	 * the overhead of synchronize_rcu() in most cases.
> -	 */
> -	if (arg->status_change_nid < 0)
> -		return notifier_from_errno(0);
> -
> -	switch (action) {
> -	case MEM_GOING_OFFLINE:
> -		/*
> -		 * Make sure there are not transient states where
> -		 * an offline node is a migration target.  This
> -		 * will leave migration disabled until the offline
> -		 * completes and the MEM_OFFLINE case below runs.
> -		 */
> -		disable_all_migrate_targets();
> -		break;
> -	case MEM_OFFLINE:
> -	case MEM_ONLINE:
> -		/*
> -		 * Recalculate the target nodes once the node
> -		 * reaches its final state (online or offline).
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_CANCEL_OFFLINE:
> -		/*
> -		 * MEM_GOING_OFFLINE disabled all the migration
> -		 * targets.  Reenable them.
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_GOING_ONLINE:
> -	case MEM_CANCEL_ONLINE:
> -		break;
> -	}
> -
> -	return notifier_from_errno(0);
> -}
> -#endif
> -
> -void __init migrate_on_reclaim_init(void)
> -{
> -	node_demotion = kcalloc(nr_node_ids,
> -				sizeof(struct demotion_nodes),
> -				GFP_KERNEL);
> -	WARN_ON(!node_demotion);
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> -#endif
> -	/*
> -	 * At this point, all numa nodes with memory/CPus have their state
> -	 * properly set, so we can build the demotion order now.
> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
> -	 * CPU hotplug events during boot.
> -	 */
> -	cpus_read_lock();
> -	set_migration_target_nodes();
> -	cpus_read_unlock();
> -}
>  #endif /* CONFIG_NUMA */
> -
> -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 373d2730fcf2..35c6ff97cf29 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -28,7 +28,6 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> -#include <linux/migrate.h>
>  
>  #include "internal.h"
>  
> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>  
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> -		set_migration_target_nodes();
>  	}

"{" and "}" can be removed too?

Best Regards,
Huang, Ying

>  
>  	return 0;
> @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  		return 0;
>  
>  	node_clear_state(node, N_CPU);
> -	set_migration_target_nodes();
>  
>  	return 0;
>  }
> @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
>  
>  	start_shepherd_timer();
>  #endif
> -	migrate_on_reclaim_init();
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
>  	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-20  2:59 ` [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-26  8:02   ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  8:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Also update different helpes to use NODE_DATA()->memtier. Since
> node specific memtier can change based on the reassignment of
> NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
> needs to happen under an rcu read lock or memory_tier_lock.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/mmzone.h |  3 ++
>  mm/memory-tiers.c      | 65 +++++++++++++++++++++++++++++++++++-------
>  2 files changed, 57 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aab70355d64f..353812495a70 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>  	/* Per-node vmstats */
>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
> +#ifdef CONFIG_NUMA
> +	struct memory_tier __rcu *memtier;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index a8cfe2ca3903..4715f9b96a44 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -138,13 +138,18 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>  
>  static struct memory_tier *__node_get_memory_tier(int node)
>  {
> -	struct memory_tier *memtier;
> +	pg_data_t *pgdat;
>  
> -	list_for_each_entry(memtier, &memory_tiers, list) {
> -		if (node_isset(node, memtier->nodelist))
> -			return memtier;
> -	}
> -	return NULL;
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return NULL;
> +	/*
> +	 * Since we hold memory_tier_lock, we can avoid
> +	 * RCU read locks when accessing the details. No
> +	 * parallel updates are possible here.
> +	 */
> +	return rcu_dereference_check(pgdat->memtier,
> +				     lockdep_is_held(&memory_tier_lock));
>  }
>  
>  #ifdef CONFIG_MIGRATION
> @@ -277,6 +282,29 @@ static inline void disable_all_migrate_targets(void) {}
>  static inline void establish_migration_targets(void) {}
>  #endif /* CONFIG_MIGRATION */
>  
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +	struct memory_tier *current_memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> +	current_memtier = rcu_dereference_check(pgdat->memtier,
> +						lockdep_is_held(&memory_tier_lock));
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();
> +	if (current_memtier)
> +		node_clear(node, current_memtier->nodelist);

If pgdat->memtier == NULL, we don't need to set it to NULL and call
synchronize_rcu().  That is,

+	current_memtier = rcu_dereference_check(pgdat->memtier,
+						lockdep_is_held(&memory_tier_lock));
+	if (current_memtier) {
+               rcu_assign_pointer(pgdat->memtier, NULL);
+               synchronize_rcu();
+		node_clear(node, current_memtier->nodelist);
+       }

Same for clear_node_memory_tier().

Best Regards,
Huang, Ying

> +	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
>  static void init_node_memory_tier(int node)
>  {
>  	int perf_level;
> @@ -295,7 +323,7 @@ static void init_node_memory_tier(int node)
>  	if (!memtier) {
>  		perf_level = node_devices[node]->perf_level;
>  		memtier = find_create_memory_tier(perf_level);
> -		node_set(node, memtier->nodelist);
> +		memtier_node_set(node, memtier);
>  	}
>  	establish_migration_targets();
>  	mutex_unlock(&memory_tier_lock);
> @@ -303,12 +331,25 @@ static void init_node_memory_tier(int node)
>  
>  static void clear_node_memory_tier(int node)
>  {
> -	struct memory_tier *memtier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *current_memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
>  
>  	mutex_lock(&memory_tier_lock);
> -	memtier = __node_get_memory_tier(node);
> -	if (memtier) {
> -		node_clear(node, memtier->nodelist);
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> +	current_memtier = rcu_dereference_check(pgdat->memtier,
> +						lockdep_is_held(&memory_tier_lock));
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();
> +	if (current_memtier) {
> +		node_clear(node, current_memtier->nodelist);
>  		establish_migration_targets();
>  	}
>  	mutex_unlock(&memory_tier_lock);
> @@ -383,6 +424,8 @@ static int __init memory_tier_init(void)
>  
>  		if (!node_property->perf_level)
>  			node_property->perf_level = default_memtier_perf_level;
> +
> +		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
>  	}
>  	mutex_unlock(&memory_tier_lock);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order
  2022-07-20  2:59 ` [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-07-26  8:24   ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-26  8:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> From: Jagdish Gediya <jvgediya.oss@gmail.com>
>
> Currently, a higher tier node can only be demoted to selected
> nodes on the next lower tier as defined by the demotion path.
> This strict, hard-coded demotion order does not work in all
> use cases (e.g. some use cases may want to allow cross-socket
> demotion to another node in the same demotion tier as a fallback
> when the preferred demotion node is out of space). This demotion
> order is also inconsistent with the page allocation fallback order
> when all the nodes in a higher tier are out of space: The page
> allocation can fall back to any node from any lower tier, whereas
> the demotion order doesn't allow that currently.
>
> This patch adds support to get all the allowed demotion targets
> for a memory tier. demote_page_list() function is now modified
> to utilize this allowed node mask as the fallback allocation mask.
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 11 +++++++
>  mm/memory-tiers.c            | 54 +++++++++++++++++++++++++++++++--
>  mm/vmscan.c                  | 58 ++++++++++++++++++++++++++----------
>  3 files changed, 106 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 852e86bd0a23..0e58588fa066 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -19,11 +19,17 @@
>  extern bool numa_demotion_enabled;
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
> +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>  #else
>  static inline int next_demotion_node(int node)
>  {
>  	return NUMA_NO_NODE;
>  }
> +
> +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
> +{
> +	*targets = NODE_MASK_NONE;
> +}
>  #endif
>  
>  #else
> @@ -33,5 +39,10 @@ static inline int next_demotion_node(int node)
>  {
>  	return NUMA_NO_NODE;
>  }
> +
> +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
> +{
> +	*targets = NODE_MASK_NONE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 4715f9b96a44..4a96e4213d66 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -15,6 +15,7 @@ struct memory_tier {
>  	struct list_head list;
>  	int perf_level;
>  	nodemask_t nodelist;
> +	nodemask_t lower_tier_mask;
>  };
>  
>  struct demotion_nodes {
> @@ -153,6 +154,24 @@ static struct memory_tier *__node_get_memory_tier(int node)
>  }
>  
>  #ifdef CONFIG_MIGRATION
> +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
> +{
> +	struct memory_tier *memtier;
> +
> +	/*
> +	 * pg_data_t.memtier updates includes a synchronize_rcu()
> +	 * which ensures that we either find NULL or a valid memtier
> +	 * in NODE_DATA. protect the access via rcu_read_lock();
> +	 */
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (memtier)
> +		*targets = memtier->lower_tier_mask;
> +	else
> +		*targets = NODE_MASK_NONE;
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
>   * @node: The starting node to lookup the next node
> @@ -201,10 +220,19 @@ int next_demotion_node(int node)
>  /* Disable reclaim-based migration. */
>  static void __disable_all_migrate_targets(void)
>  {
> +	struct memory_tier *memtier;
>  	int node;
>  
> -	for_each_node_state(node, N_MEMORY)
> +	for_each_node_state(node, N_MEMORY) {
>  		node_demotion[node].preferred = NODE_MASK_NONE;
> +		/*
> +		 * We are holding memory_tier_lock, it is safe
> +		 * to access pgda->memtier.
> +		 */
> +		memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
> +						lockdep_is_held(&memory_tier_lock));
> +		memtier->lower_tier_mask = NODE_MASK_NONE;
> +	}
>  }
>  
>  static void disable_all_migrate_targets(void)
> @@ -230,7 +258,7 @@ static void establish_migration_targets(void)
>  	struct demotion_nodes *nd;
>  	int target = NUMA_NO_NODE, node;
>  	int distance, best_distance;
> -	nodemask_t used;
> +	nodemask_t used, lower_tier = NODE_MASK_NONE;
>  
>  	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
>  		return;
> @@ -276,6 +304,28 @@ static void establish_migration_targets(void)
>  			}
>  		} while (1);
>  	}
> +	/*
> +	 * Now build the lower_tier mask for each node collecting node mask from
> +	 * all memory tier below it. This allows us to fallback demotion page
> +	 * allocation to a set of nodes that is closer the above selected
> +	 * perferred node.
> +	 */
> +	list_for_each_entry(memtier, &memory_tiers, list)
> +		nodes_or(lower_tier, lower_tier, memtier->nodelist);
> +	/*
> +	 * Removes nodes not yet in N_MEMORY.
> +	 */
> +	nodes_and(lower_tier, node_states[N_MEMORY], lower_tier);

The above code is equivalent with

        lower_tier = node_states[N_MEMORY];

?

> +
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		/*
> +		 * Keep removing current tier from lower_tier nodes,
> +		 * This will remove all nodes in current and above
> +		 * memory tier from the lower_tier mask.
> +		 */
> +		nodes_andnot(lower_tier, lower_tier, memtier->nodelist);
> +		memtier->lower_tier_mask = lower_tier;
> +	}

This is per-memtier instead of per-node.  So we need not run this code
for each node?  That is, move the above code out of for_each_node()
loop?

Best Regards,
Huang, Ying

>  }
>  #else
>  static inline void disable_all_migrate_targets(void) {}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3a8f78277f99..60a5235dd639 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio,
>  		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>  }
>  
> -static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  {
> -	struct migration_target_control mtc = {
> -		/*
> -		 * Allocate from 'node', or fail quickly and quietly.
> -		 * When this happens, 'page' will likely just be discarded
> -		 * instead of migrated.
> -		 */
> -		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> -			    __GFP_THISNODE  | __GFP_NOWARN |
> -			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> -		.nid = node
> -	};
> +	struct page *target_page;
> +	nodemask_t *allowed_mask;
> +	struct migration_target_control *mtc;
> +
> +	mtc = (struct migration_target_control *)private;
> +
> +	allowed_mask = mtc->nmask;
> +	/*
> +	 * make sure we allocate from the target node first also trying to
> +	 * reclaim pages from the target node via kswapd if we are low on
           ~~~~~~~

demote or reclaim

> +	 * free memory on target node. If we don't do this and if we have low
                                                           ~~~~~~~~~~~~~~~~~~
> +	 * free memory on the target memtier, we would start allocating pages
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and if we have free memory on the slower(lower) memtier,

> +	 * from higher memory tiers without even forcing a demotion of cold
                ~~~~~~

slower(lower)

> +	 * pages from the target memtier. This can result in the kernel placing
                                 ~~~~~~~

node

> +	 * hotpages in higher memory tiers.
           ~~~~~~~~    ~~~~~~

hot pages

slower(lower)

Best Regards,
Huang, Ying

> +	 */
> +	mtc->nmask = NULL;
> +	mtc->gfp_mask |= __GFP_THISNODE;
> +	target_page = alloc_migration_target(page, (unsigned long)mtc);
> +	if (target_page)
> +		return target_page;
>  
> -	return alloc_migration_target(page, (unsigned long)&mtc);
> +	mtc->gfp_mask &= ~__GFP_THISNODE;
> +	mtc->nmask = allowed_mask;
> +
> +	return alloc_migration_target(page, (unsigned long)mtc);
>  }
>  
>  /*
> @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  {
>  	int target_nid = next_demotion_node(pgdat->node_id);
>  	unsigned int nr_succeeded;
> +	nodemask_t allowed_mask;
> +
> +	struct migration_target_control mtc = {
> +		/*
> +		 * Allocate from 'node', or fail quickly and quietly.
> +		 * When this happens, 'page' will likely just be discarded
> +		 * instead of migrated.
> +		 */
> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> +		.nid = target_nid,
> +		.nmask = &allowed_mask
> +	};
>  
>  	if (list_empty(demote_pages))
>  		return 0;
> @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  	if (target_nid == NUMA_NO_NODE)
>  		return 0;
>  
> +	node_get_allowed_targets(pgdat, &allowed_mask);
> +
>  	/* Demotion ignores all cpuset and mempolicy settings */
>  	migrate_pages(demote_pages, alloc_demote_page, NULL,
> -			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> -			    &nr_succeeded);
> +		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> +		      &nr_succeeded);
>  
>  	if (current_is_kswapd())
>  		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-26  3:53   ` Huang, Ying
@ 2022-07-26 11:59     ` Aneesh Kumar K V
  2022-07-27  1:16       ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-26 11:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya


>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 40d641a8bfb0..a2a16d4104fd 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -92,6 +92,12 @@ struct node {
>>  	struct list_head cache_attrs;
>>  	struct device *cache_dev;
>>  #endif
>> +	/*
>> +	 * For memory devices, perf_level describes
>> +	 * the device performance and how it should be used
>> +	 * while building a memory hierarchy.
>> +	 */
>> +	int perf_level;
> 
> Think again, I found that "perf_level" may be not the best abstraction
> of the performance of memory devices.  In concept, it's an abstraction of the memory
> bandwidth.  But it will not reflect the memory latency.
> 
> Instead, the previous proposed "abstract_distance" is an abstraction of
> the memory latency.  Per my understanding, the memory latency has more
> direct influence on system performance.  And because the latency of the
> memory device will increase if the memory accessing throughput nears its
> max bandwidth, so the memory bandwidth can be reflected in the "abstract
> distance" too.  That is, the "abstract distance" is an abstraction of
> the memory latency under the expected memory accessing throughput.  The
> "offset" to the default "abstract distance" reflects the different
> expected memory accessing throughput.
> 
> So, I think we need some kind of abstraction of the memory latency
> instead of memory bandwidth, e.g., "abstract distance".
> 

I am reworking other parts of the patch set based on your feedback. This part
I guess we need to reach some consensus. 

IMHO perf_level (performance level) can indicate a combination of both latency
and bandwidth. It is an abstract concept that indicates the performance of the
device. As we learn more about which device attribute makes more impact in
defining hierarchy, performance level will give more weightage to that specific
attribute. It could be write latency or bandwidth. For me, distance has a direct
linkage to latency because that is how we define numa distance now. Adding
abstract to the name is not making it more abstract than perf_level. 

I am open to suggestions from others.  Wei Xu has also suggested perf_level name.
I can rename this to abstract_distance if that indicates the goal better.


-aneesh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-26  4:03   ` Huang, Ying
@ 2022-07-26 12:03     ` Aneesh Kumar K V
  2022-07-27  1:53       ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-26 12:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/26/22 9:33 AM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> If the new NUMA node onlined doesn't have a performance level assigned,
>> the kernel adds the NUMA node to default memory tier.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h |  1 +
>>  mm/memory-tiers.c            | 75 ++++++++++++++++++++++++++++++++++++
>>  2 files changed, 76 insertions(+)
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index ef380a39db3a..3d5f14d57ae6 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -14,6 +14,7 @@
>>  #define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>  /* leave one tier below this slow pmem */
>>  #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
>> +#define MEMTIER_HOTPLUG_PRIO	100
>>  
>>  extern bool numa_demotion_enabled;
>>  
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 41a21cc5ae55..cc3a47ec18e4 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/moduleparam.h>
>>  #include <linux/node.h>
>> +#include <linux/memory.h>
>>  #include <linux/memory-tiers.h>
>>  
>>  struct memory_tier {
>> @@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>>  	return new_memtier;
>>  }
>>  
>> +static struct memory_tier *__node_get_memory_tier(int node)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>> +		if (node_isset(node, memtier->nodelist))
>> +			return memtier;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static void init_node_memory_tier(int node)
> 
> set_node_memory_tier()?

That was done based on feedback from Alistair 

https://lore.kernel.org/linux-mm/87h73iapg1.fsf@nvdebian.thelocal


> 
>> +{
>> +	int perf_level;
>> +	struct memory_tier *memtier;
>> +
>> +	mutex_lock(&memory_tier_lock);
>> +
>> +	memtier = __node_get_memory_tier(node);
>> +	if (!memtier) {
>> +		perf_level = node_devices[node]->perf_level;
>> +		memtier = find_create_memory_tier(perf_level);
>> +		node_set(node, memtier->nodelist);
>> +	}
>> +	mutex_unlock(&memory_tier_lock);
>> +}
>> +
>> +static void clear_node_memory_tier(int node)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	mutex_lock(&memory_tier_lock);
>> +	memtier = __node_get_memory_tier(node);
>> +	if (memtier)
>> +		node_clear(node, memtier->nodelist);
> 
> When memtier->nodelist becomes empty, we need to free memtier?
> 
>> +	mutex_unlock(&memory_tier_lock);
>> +}
>> +
>> +/*
>> + * This runs whether reclaim-based migration is enabled or not,
>> + * which ensures that the user can turn reclaim-based migration
>> + * at any time without needing to recalculate migration targets.
>> + */
> 
> The comments doesn't apply here.
> 
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> +						 unsigned long action, void *_arg)
> 
> Now we are building memory tiers instead of working on demotion.  So I
> think we should rename the function to memtier_hotplug_callback().
> 
>> +{
>> +	struct memory_notify *arg = _arg;
>> +
>> +	/*
>> +	 * Only update the node migration order when a node is
>> +	 * changing status, like online->offline.
>> +	 */
>> +	if (arg->status_change_nid < 0)
>> +		return notifier_from_errno(0);
>> +
>> +	switch (action) {
>> +	case MEM_OFFLINE:
>> +		clear_node_memory_tier(arg->status_change_nid);
>> +		break;
>> +	case MEM_ONLINE:
>> +		init_node_memory_tier(arg->status_change_nid);
>> +		break;
>> +	}
>> +
>> +	return notifier_from_errno(0);
>> +}
>> +
>> +static void __init migrate_on_reclaim_init(void)
>> +{
>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>> +}
> 
> I suggest to call hotplug_memory_notifier() in memory_tier_init()
> directly.  We are not working on demotion here.
> 
>> +
>>  static int __init memory_tier_init(void)
>>  {
>>  	int node;
>> @@ -96,6 +169,8 @@ static int __init memory_tier_init(void)
>>  			node_property->perf_level = default_memtier_perf_level;
>>  	}
>>  	mutex_unlock(&memory_tier_lock);
>> +
>> +	migrate_on_reclaim_init();
>>  	return 0;
>>  }
>>  subsys_initcall(memory_tier_init);
> 
> Best Regards,
> Huang, Ying


Will update the patch in next iteration to take care of other feedback.

Thanks
-aneesh


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-26  7:44   ` Huang, Ying
@ 2022-07-26 12:30     ` Aneesh Kumar K V
  2022-07-27  1:40       ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K V @ 2022-07-26 12:30 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/26/22 1:14 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> This patch switch the demotion target building logic to use memory tiers
>> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
>> default memory tier and additional memory tiers will be added by drivers like
>> dax kmem.
>>
>> This patch builds the demotion target for a NUMA node by looking at all
>> memory tiers below the tier to which the NUMA node belongs. The closest node
>> in the immediately following memory tier is used as a demotion target.
>>
>> Since we are now only building demotion target for N_MEMORY NUMA nodes
>> the CPU hotplug calls are removed in this patch.
>>
>> A new memory tier can be inserted into the tier hierarchy for a new set of nodes
>> without affecting the node assignment of any existing memtier, provided that
>> there is enough gap in the performance level values for the new memtier.
>>
>> The absolute value of performance level of a memtier doesn't necessarily carry
>> any meaning. Its value relative to other memtiers decides the level of this
>> memtier in the tier hierarchy.
> 
> The above 2 paragraphs appear not related to the patch.
> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h |  12 ++
>>  include/linux/migrate.h      |  13 --
>>  mm/memory-tiers.c            | 218 ++++++++++++++++++-
>>  mm/migrate.c                 | 394 -----------------------------------
>>  mm/vmstat.c                  |   4 -
>>  5 files changed, 229 insertions(+), 412 deletions(-)
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 3d5f14d57ae6..852e86bd0a23 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -17,9 +17,21 @@
>>  #define MEMTIER_HOTPLUG_PRIO	100
>>  
>>  extern bool numa_demotion_enabled;
>> +#ifdef CONFIG_MIGRATION
>> +int next_demotion_node(int node);
>> +#else
>> +static inline int next_demotion_node(int node)
>> +{
>> +	return NUMA_NO_NODE;
>> +}
>> +#endif
>>  
>>  #else
>>  
>>  #define numa_demotion_enabled	false
>> +static inline int next_demotion_node(int node)
>> +{
>> +	return NUMA_NO_NODE;
>> +}
>>  #endif	/* CONFIG_NUMA */
>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 43e737215f33..93fab62e6548 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>>  
>>  #endif /* CONFIG_MIGRATION */
>>  
>> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
>> -extern void set_migration_target_nodes(void);
>> -extern void migrate_on_reclaim_init(void);
>> -extern int next_demotion_node(int node);
>> -#else
>> -static inline void set_migration_target_nodes(void) {}
>> -static inline void migrate_on_reclaim_init(void) {}
>> -static inline int next_demotion_node(int node)
>> -{
>> -        return NUMA_NO_NODE;
>> -}
>> -#endif
>> -
>>  #ifdef CONFIG_COMPACTION
>>  extern int PageMovable(struct page *page);
>>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index cc3a47ec18e4..a8cfe2ca3903 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -6,17 +6,88 @@
>>  #include <linux/moduleparam.h>
>>  #include <linux/node.h>
>>  #include <linux/memory.h>
>> +#include <linux/random.h>
>>  #include <linux/memory-tiers.h>
>>  
>> +#include "internal.h"
>> +
>>  struct memory_tier {
>>  	struct list_head list;
>>  	int perf_level;
>>  	nodemask_t nodelist;
>>  };
>>  
>> +struct demotion_nodes {
>> +	nodemask_t preferred;
>> +};
>> +
>>  static LIST_HEAD(memory_tiers);
>>  static DEFINE_MUTEX(memory_tier_lock);
>>  
>> +#ifdef CONFIG_MIGRATION
>> +/*
>> + * node_demotion[] examples:
>> + *
>> + * Example 1:
>> + *
>> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
>> + *
>> + * node distances:
>> + * node   0    1    2    3
>> + *    0  10   20   30   40
>> + *    1  20   10   40   30
>> + *    2  30   40   10   40
>> + *    3  40   30   40   10
>> + *
>> + * memory_tiers[0] = <empty>
>> + * memory_tiers[1] = 0-1
>> + * memory_tiers[2] = 2-3
> 
> We don't have memory_tiers array now, and we don't use static memory
> tier ID too.  So I suggest to change the above to,
> 
> memory_tier0: 0-1
> memory_tier1: 2-3
> 
>> + *
>> + * node_demotion[0].preferred = 2
>> + * node_demotion[1].preferred = 3
>> + * node_demotion[2].preferred = <empty>
>> + * node_demotion[3].preferred = <empty>
>> + *
>> + * Example 2:
>> + *
>> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
>> + *
>> + * node distances:
>> + * node   0    1    2
>> + *    0  10   20   30
>> + *    1  20   10   30
>> + *    2  30   30   10
>> + *
>> + * memory_tiers[0] = <empty>
>> + * memory_tiers[1] = 0-2
>> + * memory_tiers[2] = <empty>
>> + *
>> + * node_demotion[0].preferred = <empty>
>> + * node_demotion[1].preferred = <empty>
>> + * node_demotion[2].preferred = <empty>
>> + *
>> + * Example 3:
>> + *
>> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
>> + *
>> + * node distances:
>> + * node   0    1    2
>> + *    0  10   20   30
>> + *    1  20   10   40
>> + *    2  30   40   10
>> + *
>> + * memory_tiers[0] = 1
>> + * memory_tiers[1] = 0
>> + * memory_tiers[2] = 2
>> + *
>> + * node_demotion[0].preferred = 2
>> + * node_demotion[1].preferred = 0
>> + * node_demotion[2].preferred = <empty>
>> + *
>> + */
>> +static struct demotion_nodes *node_demotion __read_mostly;
>> +#endif /* CONFIG_MIGRATION */
>> +
>>  /*
>>   * For now let's have 4 memory tier below default DRAM tier.
>>   */
>> @@ -76,6 +147,136 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>  	return NULL;
>>  }
>>  
>> +#ifdef CONFIG_MIGRATION
>> +/**
>> + * next_demotion_node() - Get the next node in the demotion path
>> + * @node: The starting node to lookup the next node
>> + *
>> + * Return: node id for next memory node in the demotion path hierarchy
>> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
>> + * @node online or guarantee that it *continues* to be the next demotion
>> + * target.
>> + */
>> +int next_demotion_node(int node)
>> +{
>> +	struct demotion_nodes *nd;
>> +	int target;
>> +
>> +	if (!node_demotion)
>> +		return NUMA_NO_NODE;
>> +
>> +	nd = &node_demotion[node];
>> +
>> +	/*
>> +	 * node_demotion[] is updated without excluding this
>> +	 * function from running.
>> +	 *
>> +	 * Make sure to use RCU over entire code blocks if
>> +	 * node_demotion[] reads need to be consistent.
>> +	 */
>> +	rcu_read_lock();
>> +	/*
>> +	 * If there are multiple target nodes, just select one
>> +	 * target node randomly.
>> +	 *
>> +	 * In addition, we can also use round-robin to select
>> +	 * target node, but we should introduce another variable
>> +	 * for node_demotion[] to record last selected target node,
>> +	 * that may cause cache ping-pong due to the changing of
>> +	 * last target node. Or introducing per-cpu data to avoid
>> +	 * caching issue, which seems more complicated. So selecting
>> +	 * target node randomly seems better until now.
>> +	 */
>> +	target = node_random(&nd->preferred);
> 
> In one of the most common cases, nodes_weight(&nd->preferred) == 1.
> Where, get_random_int() in node_random() just wastes CPU cycles and
> random entropy.  So the original struct demotion_nodes implementation
> appears better.
> 
>   struct demotion_nodes {
>          unsigned short nr;
>          short nodes[DEMOTION_TARGET_NODES];
>   };
> 


Is that measurable difference? using nodemask_t makes it much easier with respect to
implementation. IMHO if we observe the usage of node_random() to have performance impact
with nodes_weight() == 1 we should fix node_random() to handle that? If you strongly
feel we should fix this, i can opencode node_random to special case node_weight() == 1?

-	target = node_random(&nd->preferred);
+	node_weight = nodes_weight(nd->preferred);
+	switch (node_weight) {
+	case 0:
+		target = NUMA_NO_NODE;
+		break;
+	case 1:
+		target = first_node(nd->preferred);
+		break;
+	default:
+		target = bitmap_ord_to_pos(nd->preferred.bits,
+					   get_random_int() % node_weight, MAX_NUMNODES);
+		break;
+	}
 

>> +	rcu_read_unlock();
>> +
>> +	return target;
>> +}
>> +
>> +/* Disable reclaim-based migration. */
>> +static void __disable_all_migrate_targets(void)
> 
> How about rename "migrate" to "demote" to make it more specific?
> 
>> +{
>> +	int node;
>> +
>> +	for_each_node_state(node, N_MEMORY)
>> +		node_demotion[node].preferred = NODE_MASK_NONE;
>> +}
>> +
>> +static void disable_all_migrate_targets(void)
>> +{
>> +	__disable_all_migrate_targets();
>> +
>> +	/*
>> +	 * Ensure that the "disable" is visible across the system.
>> +	 * Readers will see either a combination of before+disable
>> +	 * state or disable+after.  They will never see before and
>> +	 * after state together.
>> +	 */
>> +	synchronize_rcu();
>> +}
>> +/*
>> + * Find an automatic demotion target for all memory
>> + * nodes. Failing here is OK.  It might just indicate
>> + * being at the end of a chain.
>> + */
>> +static void establish_migration_targets(void)
>> +{
>> +	struct memory_tier *memtier;
>> +	struct demotion_nodes *nd;
>> +	int target = NUMA_NO_NODE, node;
>> +	int distance, best_distance;
>> +	nodemask_t used;
>> +
>> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
>> +		return;
>> +
>> +	disable_all_migrate_targets();
>> +
>> +	for_each_node_state(node, N_MEMORY) {
>> +		best_distance = -1;
>> +		nd = &node_demotion[node];
>> +
>> +		memtier = __node_get_memory_tier(node);
>> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
>> +			continue;
>> +		/*
>> +		 * Get the next memtier to find the  demotion node list.
>> +		 */
>> +		memtier = list_next_entry(memtier, list);
>> +
>> +		/*
>> +		 * find_next_best_node, use 'used' nodemask as a skip list.
>> +		 * Add all memory nodes except the selected memory tier
>> +		 * nodelist to skip list so that we find the best node from the
>> +		 * memtier nodelist.
>> +		 */
>> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
>> +
>> +		/*
>> +		 * Find all the nodes in the memory tier node list of same best distance.
>> +		 * add them to the preferred mask. We randomly select between nodes
>> +		 * in the preferred mask when allocating pages during demotion.
>> +		 */
>> +		do {
>> +			target = find_next_best_node(node, &used);
>> +			if (target == NUMA_NO_NODE)
>> +				break;
>> +
>> +			distance = node_distance(node, target);
>> +			if (distance == best_distance || best_distance == -1) {
>> +				best_distance = distance;
>> +				node_set(target, nd->preferred);
>> +			} else {
>> +				break;
>> +			}
>> +		} while (1);
>> +	}
>> +}
>> +#else
>> +static inline void disable_all_migrate_targets(void) {}
>> +static inline void establish_migration_targets(void) {}
>> +#endif /* CONFIG_MIGRATION */
>> +
>>  static void init_node_memory_tier(int node)
>>  {
>>  	int perf_level;
>> @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
>>  	mutex_lock(&memory_tier_lock);
>>  
>>  	memtier = __node_get_memory_tier(node);
>> +	/*
>> +	 * if node is already part of the tier proceed with the
>> +	 * current tier value, because we might want to establish
>> +	 * new migration paths now. The node might be added to a tier
>> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
>> +	 * will have skipped this node.
>> +	 */
>>  	if (!memtier) {
>>  		perf_level = node_devices[node]->perf_level;
>>  		memtier = find_create_memory_tier(perf_level);
>>  		node_set(node, memtier->nodelist);
>>  	}
>> +	establish_migration_targets();
> 
> Why combines memory tiers establishing with demotion targets building?
> I think that it's better to separate them.   For example, if we move a
> set of NUMA node from one memory tier to another memory tier, we only
> need to run establish_migration_targets() once after moving all nodes.
> 

Yes agree. I am not sure I followed your comment here. 

Demotion target rebuilding is a separate helper. Any update to memory tiers needs rebuilding
of demotion targets. Also any change in node mask of memory tier needs
demotion target rebuild. Can you clarify the code change you are suggesting here?



>>  	mutex_unlock(&memory_tier_lock);
>>  }
>>  
>> @@ -98,8 +307,10 @@ static void clear_node_memory_tier(int node)
>>  
>>  	mutex_lock(&memory_tier_lock);
>>  	memtier = __node_get_memory_tier(node);
>> -	if (memtier)
>> +	if (memtier) {
>>  		node_clear(node, memtier->nodelist);
>> +		establish_migration_targets();
>> +	}
>>  	mutex_unlock(&memory_tier_lock);
>>  }
>>  
>> @@ -134,6 +345,11 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>>  
>>  static void __init migrate_on_reclaim_init(void)
>>  {
>> +	if (IS_ENABLED(CONFIG_MIGRATION)) {
>> +		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
> 
> Why allocate MAX_NUMNODES instead of nr_node_ids as before?
> 
>> +					GFP_KERNEL);
>> +		WARN_ON(!node_demotion);
>> +	}
>>  	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>>  }
>>  
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index fce7d4a9e940..c758c9c21d7d 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>>  	return 0;
>>  }
>>  #endif /* CONFIG_NUMA_BALANCING */
>> -
>> -/*
>> - * node_demotion[] example:
>> - *
>> - * Consider a system with two sockets.  Each socket has
>> - * three classes of memory attached: fast, medium and slow.
>> - * Each memory class is placed in its own NUMA node.  The
>> - * CPUs are placed in the node with the "fast" memory.  The
>> - * 6 NUMA nodes (0-5) might be split among the sockets like
>> - * this:
>> - *
>> - *	Socket A: 0, 1, 2
>> - *	Socket B: 3, 4, 5
>> - *
>> - * When Node 0 fills up, its memory should be migrated to
>> - * Node 1.  When Node 1 fills up, it should be migrated to
>> - * Node 2.  The migration path start on the nodes with the
>> - * processors (since allocations default to this node) and
>> - * fast memory, progress through medium and end with the
>> - * slow memory:
>> - *
>> - *	0 -> 1 -> 2 -> stop
>> - *	3 -> 4 -> 5 -> stop
>> - *
>> - * This is represented in the node_demotion[] like this:
>> - *
>> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
>> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
>> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
>> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
>> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
>> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
>> - *
>> - * Moreover some systems may have multiple slow memory nodes.
>> - * Suppose a system has one socket with 3 memory nodes, node 0
>> - * is fast memory type, and node 1/2 both are slow memory
>> - * type, and the distance between fast memory node and slow
>> - * memory node is same. So the migration path should be:
>> - *
>> - *	0 -> 1/2 -> stop
>> - *
>> - * This is represented in the node_demotion[] like this:
>> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
>> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
>> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
>> - */
>> -
>> -/*
>> - * Writes to this array occur without locking.  Cycles are
>> - * not allowed: Node X demotes to Y which demotes to X...
>> - *
>> - * If multiple reads are performed, a single rcu_read_lock()
>> - * must be held over all reads to ensure that no cycles are
>> - * observed.
>> - */
>> -#define DEFAULT_DEMOTION_TARGET_NODES 15
>> -
>> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
>> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
>> -#else
>> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
>> -#endif
>> -
>> -struct demotion_nodes {
>> -	unsigned short nr;
>> -	short nodes[DEMOTION_TARGET_NODES];
>> -};
>> -
>> -static struct demotion_nodes *node_demotion __read_mostly;
>> -
>> -/**
>> - * next_demotion_node() - Get the next node in the demotion path
>> - * @node: The starting node to lookup the next node
>> - *
>> - * Return: node id for next memory node in the demotion path hierarchy
>> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
>> - * @node online or guarantee that it *continues* to be the next demotion
>> - * target.
>> - */
>> -int next_demotion_node(int node)
>> -{
>> -	struct demotion_nodes *nd;
>> -	unsigned short target_nr, index;
>> -	int target;
>> -
>> -	if (!node_demotion)
>> -		return NUMA_NO_NODE;
>> -
>> -	nd = &node_demotion[node];
>> -
>> -	/*
>> -	 * node_demotion[] is updated without excluding this
>> -	 * function from running.  RCU doesn't provide any
>> -	 * compiler barriers, so the READ_ONCE() is required
>> -	 * to avoid compiler reordering or read merging.
>> -	 *
>> -	 * Make sure to use RCU over entire code blocks if
>> -	 * node_demotion[] reads need to be consistent.
>> -	 */
>> -	rcu_read_lock();
>> -	target_nr = READ_ONCE(nd->nr);
>> -
>> -	switch (target_nr) {
>> -	case 0:
>> -		target = NUMA_NO_NODE;
>> -		goto out;
>> -	case 1:
>> -		index = 0;
>> -		break;
>> -	default:
>> -		/*
>> -		 * If there are multiple target nodes, just select one
>> -		 * target node randomly.
>> -		 *
>> -		 * In addition, we can also use round-robin to select
>> -		 * target node, but we should introduce another variable
>> -		 * for node_demotion[] to record last selected target node,
>> -		 * that may cause cache ping-pong due to the changing of
>> -		 * last target node. Or introducing per-cpu data to avoid
>> -		 * caching issue, which seems more complicated. So selecting
>> -		 * target node randomly seems better until now.
>> -		 */
>> -		index = get_random_int() % target_nr;
>> -		break;
>> -	}
>> -
>> -	target = READ_ONCE(nd->nodes[index]);
>> -
>> -out:
>> -	rcu_read_unlock();
>> -	return target;
>> -}
>> -
>> -/* Disable reclaim-based migration. */
>> -static void __disable_all_migrate_targets(void)
>> -{
>> -	int node, i;
>> -
>> -	if (!node_demotion)
>> -		return;
>> -
>> -	for_each_online_node(node) {
>> -		node_demotion[node].nr = 0;
>> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
>> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
>> -	}
>> -}
>> -
>> -static void disable_all_migrate_targets(void)
>> -{
>> -	__disable_all_migrate_targets();
>> -
>> -	/*
>> -	 * Ensure that the "disable" is visible across the system.
>> -	 * Readers will see either a combination of before+disable
>> -	 * state or disable+after.  They will never see before and
>> -	 * after state together.
>> -	 *
>> -	 * The before+after state together might have cycles and
>> -	 * could cause readers to do things like loop until this
>> -	 * function finishes.  This ensures they can only see a
>> -	 * single "bad" read and would, for instance, only loop
>> -	 * once.
>> -	 */
>> -	synchronize_rcu();
>> -}
>> -
>> -/*
>> - * Find an automatic demotion target for 'node'.
>> - * Failing here is OK.  It might just indicate
>> - * being at the end of a chain.
>> - */
>> -static int establish_migrate_target(int node, nodemask_t *used,
>> -				    int best_distance)
>> -{
>> -	int migration_target, index, val;
>> -	struct demotion_nodes *nd;
>> -
>> -	if (!node_demotion)
>> -		return NUMA_NO_NODE;
>> -
>> -	nd = &node_demotion[node];
>> -
>> -	migration_target = find_next_best_node(node, used);
>> -	if (migration_target == NUMA_NO_NODE)
>> -		return NUMA_NO_NODE;
>> -
>> -	/*
>> -	 * If the node has been set a migration target node before,
>> -	 * which means it's the best distance between them. Still
>> -	 * check if this node can be demoted to other target nodes
>> -	 * if they have a same best distance.
>> -	 */
>> -	if (best_distance != -1) {
>> -		val = node_distance(node, migration_target);
>> -		if (val > best_distance)
>> -			goto out_clear;
>> -	}
>> -
>> -	index = nd->nr;
>> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
>> -		      "Exceeds maximum demotion target nodes\n"))
>> -		goto out_clear;
>> -
>> -	nd->nodes[index] = migration_target;
>> -	nd->nr++;
>> -
>> -	return migration_target;
>> -out_clear:
>> -	node_clear(migration_target, *used);
>> -	return NUMA_NO_NODE;
>> -}
>> -
>> -/*
>> - * When memory fills up on a node, memory contents can be
>> - * automatically migrated to another node instead of
>> - * discarded at reclaim.
>> - *
>> - * Establish a "migration path" which will start at nodes
>> - * with CPUs and will follow the priorities used to build the
>> - * page allocator zonelists.
>> - *
>> - * The difference here is that cycles must be avoided.  If
>> - * node0 migrates to node1, then neither node1, nor anything
>> - * node1 migrates to can migrate to node0. Also one node can
>> - * be migrated to multiple nodes if the target nodes all have
>> - * a same best-distance against the source node.
>> - *
>> - * This function can run simultaneously with readers of
>> - * node_demotion[].  However, it can not run simultaneously
>> - * with itself.  Exclusion is provided by memory hotplug events
>> - * being single-threaded.
>> - */
>> -static void __set_migration_target_nodes(void)
>> -{
>> -	nodemask_t next_pass;
>> -	nodemask_t this_pass;
>> -	nodemask_t used_targets = NODE_MASK_NONE;
>> -	int node, best_distance;
>> -
>> -	/*
>> -	 * Avoid any oddities like cycles that could occur
>> -	 * from changes in the topology.  This will leave
>> -	 * a momentary gap when migration is disabled.
>> -	 */
>> -	disable_all_migrate_targets();
>> -
>> -	/*
>> -	 * Allocations go close to CPUs, first.  Assume that
>> -	 * the migration path starts at the nodes with CPUs.
>> -	 */
>> -	next_pass = node_states[N_CPU];
>> -again:
>> -	this_pass = next_pass;
>> -	next_pass = NODE_MASK_NONE;
>> -	/*
>> -	 * To avoid cycles in the migration "graph", ensure
>> -	 * that migration sources are not future targets by
>> -	 * setting them in 'used_targets'.  Do this only
>> -	 * once per pass so that multiple source nodes can
>> -	 * share a target node.
>> -	 *
>> -	 * 'used_targets' will become unavailable in future
>> -	 * passes.  This limits some opportunities for
>> -	 * multiple source nodes to share a destination.
>> -	 */
>> -	nodes_or(used_targets, used_targets, this_pass);
>> -
>> -	for_each_node_mask(node, this_pass) {
>> -		best_distance = -1;
>> -
>> -		/*
>> -		 * Try to set up the migration path for the node, and the target
>> -		 * migration nodes can be multiple, so doing a loop to find all
>> -		 * the target nodes if they all have a best node distance.
>> -		 */
>> -		do {
>> -			int target_node =
>> -				establish_migrate_target(node, &used_targets,
>> -							 best_distance);
>> -
>> -			if (target_node == NUMA_NO_NODE)
>> -				break;
>> -
>> -			if (best_distance == -1)
>> -				best_distance = node_distance(node, target_node);
>> -
>> -			/*
>> -			 * Visit targets from this pass in the next pass.
>> -			 * Eventually, every node will have been part of
>> -			 * a pass, and will become set in 'used_targets'.
>> -			 */
>> -			node_set(target_node, next_pass);
>> -		} while (1);
>> -	}
>> -	/*
>> -	 * 'next_pass' contains nodes which became migration
>> -	 * targets in this pass.  Make additional passes until
>> -	 * no more migrations targets are available.
>> -	 */
>> -	if (!nodes_empty(next_pass))
>> -		goto again;
>> -}
>> -
>> -/*
>> - * For callers that do not hold get_online_mems() already.
>> - */
>> -void set_migration_target_nodes(void)
>> -{
>> -	get_online_mems();
>> -	__set_migration_target_nodes();
>> -	put_online_mems();
>> -}
>> -
>> -/*
>> - * This leaves migrate-on-reclaim transiently disabled between
>> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
>> - * whether reclaim-based migration is enabled or not, which
>> - * ensures that the user can turn reclaim-based migration at
>> - * any time without needing to recalculate migration targets.
>> - *
>> - * These callbacks already hold get_online_mems().  That is why
>> - * __set_migration_target_nodes() can be used as opposed to
>> - * set_migration_target_nodes().
>> - */
>> -#ifdef CONFIG_MEMORY_HOTPLUG
>> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> -						 unsigned long action, void *_arg)
>> -{
>> -	struct memory_notify *arg = _arg;
>> -
>> -	/*
>> -	 * Only update the node migration order when a node is
>> -	 * changing status, like online->offline.  This avoids
>> -	 * the overhead of synchronize_rcu() in most cases.
>> -	 */
>> -	if (arg->status_change_nid < 0)
>> -		return notifier_from_errno(0);
>> -
>> -	switch (action) {
>> -	case MEM_GOING_OFFLINE:
>> -		/*
>> -		 * Make sure there are not transient states where
>> -		 * an offline node is a migration target.  This
>> -		 * will leave migration disabled until the offline
>> -		 * completes and the MEM_OFFLINE case below runs.
>> -		 */
>> -		disable_all_migrate_targets();
>> -		break;
>> -	case MEM_OFFLINE:
>> -	case MEM_ONLINE:
>> -		/*
>> -		 * Recalculate the target nodes once the node
>> -		 * reaches its final state (online or offline).
>> -		 */
>> -		__set_migration_target_nodes();
>> -		break;
>> -	case MEM_CANCEL_OFFLINE:
>> -		/*
>> -		 * MEM_GOING_OFFLINE disabled all the migration
>> -		 * targets.  Reenable them.
>> -		 */
>> -		__set_migration_target_nodes();
>> -		break;
>> -	case MEM_GOING_ONLINE:
>> -	case MEM_CANCEL_ONLINE:
>> -		break;
>> -	}
>> -
>> -	return notifier_from_errno(0);
>> -}
>> -#endif
>> -
>> -void __init migrate_on_reclaim_init(void)
>> -{
>> -	node_demotion = kcalloc(nr_node_ids,
>> -				sizeof(struct demotion_nodes),
>> -				GFP_KERNEL);
>> -	WARN_ON(!node_demotion);
>> -#ifdef CONFIG_MEMORY_HOTPLUG
>> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>> -#endif
>> -	/*
>> -	 * At this point, all numa nodes with memory/CPus have their state
>> -	 * properly set, so we can build the demotion order now.
>> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
>> -	 * CPU hotplug events during boot.
>> -	 */
>> -	cpus_read_lock();
>> -	set_migration_target_nodes();
>> -	cpus_read_unlock();
>> -}
>>  #endif /* CONFIG_NUMA */
>> -
>> -
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 373d2730fcf2..35c6ff97cf29 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -28,7 +28,6 @@
>>  #include <linux/mm_inline.h>
>>  #include <linux/page_ext.h>
>>  #include <linux/page_owner.h>
>> -#include <linux/migrate.h>
>>  
>>  #include "internal.h"
>>  
>> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>>  
>>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>>  		node_set_state(cpu_to_node(cpu), N_CPU);
>> -		set_migration_target_nodes();
>>  	}
> 
> "{" and "}" can be removed too?
> 

I will update  the patch incorporating other changes you suggested.

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-26 11:59     ` Aneesh Kumar K V
@ 2022-07-27  1:16       ` Huang, Ying
  2022-07-28 17:23         ` Johannes Weiner
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-27  1:16 UTC (permalink / raw)
  To: Aneesh Kumar K V, Wei Xu, Johannes Weiner
  Cc: linux-mm, akpm, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 40d641a8bfb0..a2a16d4104fd 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -92,6 +92,12 @@ struct node {
>>>  	struct list_head cache_attrs;
>>>  	struct device *cache_dev;
>>>  #endif
>>> +	/*
>>> +	 * For memory devices, perf_level describes
>>> +	 * the device performance and how it should be used
>>> +	 * while building a memory hierarchy.
>>> +	 */
>>> +	int perf_level;
>> 
>> Think again, I found that "perf_level" may be not the best abstraction
>> of the performance of memory devices.  In concept, it's an abstraction of the memory
>> bandwidth.  But it will not reflect the memory latency.
>> 
>> Instead, the previous proposed "abstract_distance" is an abstraction of
>> the memory latency.  Per my understanding, the memory latency has more
>> direct influence on system performance.  And because the latency of the
>> memory device will increase if the memory accessing throughput nears its
>> max bandwidth, so the memory bandwidth can be reflected in the "abstract
>> distance" too.  That is, the "abstract distance" is an abstraction of
>> the memory latency under the expected memory accessing throughput.  The
>> "offset" to the default "abstract distance" reflects the different
>> expected memory accessing throughput.
>> 
>> So, I think we need some kind of abstraction of the memory latency
>> instead of memory bandwidth, e.g., "abstract distance".
>> 
>
> I am reworking other parts of the patch set based on your feedback.

Thanks!

> This part I guess we need to reach some consensus.

Yes.  Let's do that.

> IMHO perf_level (performance level) can indicate a combination of both latency
> and bandwidth.

"abstract distance" is based on latency, and bandwidth is reflected via
"latency under the expected memory accessing throughput".

How does perf_level indicate the combination?  Per my understanding,
it's bandwidth based.

> It is an abstract concept that indicates the performance of the
> device. As we learn more about which device attribute makes more impact in
> defining hierarchy, performance level will give more weightage to that specific
> attribute. It could be write latency or bandwidth. For me, distance has a direct
> linkage to latency because that is how we define numa distance now. Adding
> abstract to the name is not making it more abstract than perf_level. 
>
> I am open to suggestions from others.  Wei Xu has also suggested perf_level name.
> I can rename this to abstract_distance if that indicates the goal better.

I'm open to naming.  But I think that it's good to define it at some
degree instead of completely opaque stuff.  If it's latency based, then
low value corresponds to high performance.  If it's bandwidth based,
then low value corresponds to low performance.

Hi, Wei and Johannes,

What do you think about this?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-26 12:30     ` Aneesh Kumar K V
@ 2022-07-27  1:40       ` Huang, Ying
  2022-07-27  4:35         ` Aneesh Kumar K.V
  2022-08-03  3:18         ` Aneesh Kumar K.V
  0 siblings, 2 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-27  1:40 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/26/22 1:14 PM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> This patch switch the demotion target building logic to use memory tiers
>>> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
>>> default memory tier and additional memory tiers will be added by drivers like
>>> dax kmem.
>>>
>>> This patch builds the demotion target for a NUMA node by looking at all
>>> memory tiers below the tier to which the NUMA node belongs. The closest node
>>> in the immediately following memory tier is used as a demotion target.
>>>
>>> Since we are now only building demotion target for N_MEMORY NUMA nodes
>>> the CPU hotplug calls are removed in this patch.
>>>
>>> A new memory tier can be inserted into the tier hierarchy for a new set of nodes
>>> without affecting the node assignment of any existing memtier, provided that
>>> there is enough gap in the performance level values for the new memtier.
>>>
>>> The absolute value of performance level of a memtier doesn't necessarily carry
>>> any meaning. Its value relative to other memtiers decides the level of this
>>> memtier in the tier hierarchy.
>> 
>> The above 2 paragraphs appear not related to the patch.
>> 
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/memory-tiers.h |  12 ++
>>>  include/linux/migrate.h      |  13 --
>>>  mm/memory-tiers.c            | 218 ++++++++++++++++++-
>>>  mm/migrate.c                 | 394 -----------------------------------
>>>  mm/vmstat.c                  |   4 -
>>>  5 files changed, 229 insertions(+), 412 deletions(-)
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> index 3d5f14d57ae6..852e86bd0a23 100644
>>> --- a/include/linux/memory-tiers.h
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -17,9 +17,21 @@
>>>  #define MEMTIER_HOTPLUG_PRIO	100
>>>  
>>>  extern bool numa_demotion_enabled;
>>> +#ifdef CONFIG_MIGRATION
>>> +int next_demotion_node(int node);
>>> +#else
>>> +static inline int next_demotion_node(int node)
>>> +{
>>> +	return NUMA_NO_NODE;
>>> +}
>>> +#endif
>>>  
>>>  #else
>>>  
>>>  #define numa_demotion_enabled	false
>>> +static inline int next_demotion_node(int node)
>>> +{
>>> +	return NUMA_NO_NODE;
>>> +}
>>>  #endif	/* CONFIG_NUMA */
>>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>> index 43e737215f33..93fab62e6548 100644
>>> --- a/include/linux/migrate.h
>>> +++ b/include/linux/migrate.h
>>> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>>>  
>>>  #endif /* CONFIG_MIGRATION */
>>>  
>>> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
>>> -extern void set_migration_target_nodes(void);
>>> -extern void migrate_on_reclaim_init(void);
>>> -extern int next_demotion_node(int node);
>>> -#else
>>> -static inline void set_migration_target_nodes(void) {}
>>> -static inline void migrate_on_reclaim_init(void) {}
>>> -static inline int next_demotion_node(int node)
>>> -{
>>> -        return NUMA_NO_NODE;
>>> -}
>>> -#endif
>>> -
>>>  #ifdef CONFIG_COMPACTION
>>>  extern int PageMovable(struct page *page);
>>>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index cc3a47ec18e4..a8cfe2ca3903 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -6,17 +6,88 @@
>>>  #include <linux/moduleparam.h>
>>>  #include <linux/node.h>
>>>  #include <linux/memory.h>
>>> +#include <linux/random.h>
>>>  #include <linux/memory-tiers.h>
>>>  
>>> +#include "internal.h"
>>> +
>>>  struct memory_tier {
>>>  	struct list_head list;
>>>  	int perf_level;
>>>  	nodemask_t nodelist;
>>>  };
>>>  
>>> +struct demotion_nodes {
>>> +	nodemask_t preferred;
>>> +};
>>> +
>>>  static LIST_HEAD(memory_tiers);
>>>  static DEFINE_MUTEX(memory_tier_lock);
>>>  
>>> +#ifdef CONFIG_MIGRATION
>>> +/*
>>> + * node_demotion[] examples:
>>> + *
>>> + * Example 1:
>>> + *
>>> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
>>> + *
>>> + * node distances:
>>> + * node   0    1    2    3
>>> + *    0  10   20   30   40
>>> + *    1  20   10   40   30
>>> + *    2  30   40   10   40
>>> + *    3  40   30   40   10
>>> + *
>>> + * memory_tiers[0] = <empty>
>>> + * memory_tiers[1] = 0-1
>>> + * memory_tiers[2] = 2-3
>> 
>> We don't have memory_tiers array now, and we don't use static memory
>> tier ID too.  So I suggest to change the above to,
>> 
>> memory_tier0: 0-1
>> memory_tier1: 2-3
>> 
>>> + *
>>> + * node_demotion[0].preferred = 2
>>> + * node_demotion[1].preferred = 3
>>> + * node_demotion[2].preferred = <empty>
>>> + * node_demotion[3].preferred = <empty>
>>> + *
>>> + * Example 2:
>>> + *
>>> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
>>> + *
>>> + * node distances:
>>> + * node   0    1    2
>>> + *    0  10   20   30
>>> + *    1  20   10   30
>>> + *    2  30   30   10
>>> + *
>>> + * memory_tiers[0] = <empty>
>>> + * memory_tiers[1] = 0-2
>>> + * memory_tiers[2] = <empty>
>>> + *
>>> + * node_demotion[0].preferred = <empty>
>>> + * node_demotion[1].preferred = <empty>
>>> + * node_demotion[2].preferred = <empty>
>>> + *
>>> + * Example 3:
>>> + *
>>> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
>>> + *
>>> + * node distances:
>>> + * node   0    1    2
>>> + *    0  10   20   30
>>> + *    1  20   10   40
>>> + *    2  30   40   10
>>> + *
>>> + * memory_tiers[0] = 1
>>> + * memory_tiers[1] = 0
>>> + * memory_tiers[2] = 2
>>> + *
>>> + * node_demotion[0].preferred = 2
>>> + * node_demotion[1].preferred = 0
>>> + * node_demotion[2].preferred = <empty>
>>> + *
>>> + */
>>> +static struct demotion_nodes *node_demotion __read_mostly;
>>> +#endif /* CONFIG_MIGRATION */
>>> +
>>>  /*
>>>   * For now let's have 4 memory tier below default DRAM tier.
>>>   */
>>> @@ -76,6 +147,136 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>>  	return NULL;
>>>  }
>>>  
>>> +#ifdef CONFIG_MIGRATION
>>> +/**
>>> + * next_demotion_node() - Get the next node in the demotion path
>>> + * @node: The starting node to lookup the next node
>>> + *
>>> + * Return: node id for next memory node in the demotion path hierarchy
>>> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
>>> + * @node online or guarantee that it *continues* to be the next demotion
>>> + * target.
>>> + */
>>> +int next_demotion_node(int node)
>>> +{
>>> +	struct demotion_nodes *nd;
>>> +	int target;
>>> +
>>> +	if (!node_demotion)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	nd = &node_demotion[node];
>>> +
>>> +	/*
>>> +	 * node_demotion[] is updated without excluding this
>>> +	 * function from running.
>>> +	 *
>>> +	 * Make sure to use RCU over entire code blocks if
>>> +	 * node_demotion[] reads need to be consistent.
>>> +	 */
>>> +	rcu_read_lock();
>>> +	/*
>>> +	 * If there are multiple target nodes, just select one
>>> +	 * target node randomly.
>>> +	 *
>>> +	 * In addition, we can also use round-robin to select
>>> +	 * target node, but we should introduce another variable
>>> +	 * for node_demotion[] to record last selected target node,
>>> +	 * that may cause cache ping-pong due to the changing of
>>> +	 * last target node. Or introducing per-cpu data to avoid
>>> +	 * caching issue, which seems more complicated. So selecting
>>> +	 * target node randomly seems better until now.
>>> +	 */
>>> +	target = node_random(&nd->preferred);
>> 
>> In one of the most common cases, nodes_weight(&nd->preferred) == 1.
>> Where, get_random_int() in node_random() just wastes CPU cycles and
>> random entropy.  So the original struct demotion_nodes implementation
>> appears better.
>> 
>>   struct demotion_nodes {
>>          unsigned short nr;
>>          short nodes[DEMOTION_TARGET_NODES];
>>   };
>> 
>
>
> Is that measurable difference? using nodemask_t makes it much easier with respect to
> implementation. IMHO if we observe the usage of node_random() to have performance impact
> with nodes_weight() == 1 we should fix node_random() to handle that? If you strongly
> feel we should fix this, i can opencode node_random to special case node_weight() == 1?

If there's no much difference, why not just use the existing code?
IMHO, it's your responsibility to prove your new implementation is
better via numbers, for example, reduced code lines, with better or same
performance.

Another policy is just to use the existing code in the first version.
Then change it based on measurement.

In general, I care more about the most common cases, that is, 0 or 1
demotion target.

> -	target = node_random(&nd->preferred);
> +	node_weight = nodes_weight(nd->preferred);
> +	switch (node_weight) {
> +	case 0:
> +		target = NUMA_NO_NODE;
> +		break;
> +	case 1:
> +		target = first_node(nd->preferred);
> +		break;
> +	default:
> +		target = bitmap_ord_to_pos(nd->preferred.bits,
> +					   get_random_int() % node_weight, MAX_NUMNODES);
> +		break;
> +	}
>  
>
>>> +	rcu_read_unlock();
>>> +
>>> +	return target;
>>> +}
>>> +
>>> +/* Disable reclaim-based migration. */
>>> +static void __disable_all_migrate_targets(void)
>> 
>> How about rename "migrate" to "demote" to make it more specific?
>> 
>>> +{
>>> +	int node;
>>> +
>>> +	for_each_node_state(node, N_MEMORY)
>>> +		node_demotion[node].preferred = NODE_MASK_NONE;
>>> +}
>>> +
>>> +static void disable_all_migrate_targets(void)
>>> +{
>>> +	__disable_all_migrate_targets();
>>> +
>>> +	/*
>>> +	 * Ensure that the "disable" is visible across the system.
>>> +	 * Readers will see either a combination of before+disable
>>> +	 * state or disable+after.  They will never see before and
>>> +	 * after state together.
>>> +	 */
>>> +	synchronize_rcu();
>>> +}
>>> +/*
>>> + * Find an automatic demotion target for all memory
>>> + * nodes. Failing here is OK.  It might just indicate
>>> + * being at the end of a chain.
>>> + */
>>> +static void establish_migration_targets(void)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +	struct demotion_nodes *nd;
>>> +	int target = NUMA_NO_NODE, node;
>>> +	int distance, best_distance;
>>> +	nodemask_t used;
>>> +
>>> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
>>> +		return;
>>> +
>>> +	disable_all_migrate_targets();
>>> +
>>> +	for_each_node_state(node, N_MEMORY) {
>>> +		best_distance = -1;
>>> +		nd = &node_demotion[node];
>>> +
>>> +		memtier = __node_get_memory_tier(node);
>>> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
>>> +			continue;
>>> +		/*
>>> +		 * Get the next memtier to find the  demotion node list.
>>> +		 */
>>> +		memtier = list_next_entry(memtier, list);
>>> +
>>> +		/*
>>> +		 * find_next_best_node, use 'used' nodemask as a skip list.
>>> +		 * Add all memory nodes except the selected memory tier
>>> +		 * nodelist to skip list so that we find the best node from the
>>> +		 * memtier nodelist.
>>> +		 */
>>> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
>>> +
>>> +		/*
>>> +		 * Find all the nodes in the memory tier node list of same best distance.
>>> +		 * add them to the preferred mask. We randomly select between nodes
>>> +		 * in the preferred mask when allocating pages during demotion.
>>> +		 */
>>> +		do {
>>> +			target = find_next_best_node(node, &used);
>>> +			if (target == NUMA_NO_NODE)
>>> +				break;
>>> +
>>> +			distance = node_distance(node, target);
>>> +			if (distance == best_distance || best_distance == -1) {
>>> +				best_distance = distance;
>>> +				node_set(target, nd->preferred);
>>> +			} else {
>>> +				break;
>>> +			}
>>> +		} while (1);
>>> +	}
>>> +}
>>> +#else
>>> +static inline void disable_all_migrate_targets(void) {}
>>> +static inline void establish_migration_targets(void) {}
>>> +#endif /* CONFIG_MIGRATION */
>>> +
>>>  static void init_node_memory_tier(int node)
>>>  {
>>>  	int perf_level;
>>> @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
>>>  	mutex_lock(&memory_tier_lock);
>>>  
>>>  	memtier = __node_get_memory_tier(node);
>>> +	/*
>>> +	 * if node is already part of the tier proceed with the
>>> +	 * current tier value, because we might want to establish
>>> +	 * new migration paths now. The node might be added to a tier
>>> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
>>> +	 * will have skipped this node.
>>> +	 */
>>>  	if (!memtier) {
>>>  		perf_level = node_devices[node]->perf_level;
>>>  		memtier = find_create_memory_tier(perf_level);
>>>  		node_set(node, memtier->nodelist);
>>>  	}
>>> +	establish_migration_targets();
>> 
>> Why combines memory tiers establishing with demotion targets building?
>> I think that it's better to separate them.   For example, if we move a
>> set of NUMA node from one memory tier to another memory tier, we only
>> need to run establish_migration_targets() once after moving all nodes.
>> 
>
> Yes agree. I am not sure I followed your comment here. 
>
> Demotion target rebuilding is a separate helper. Any update to memory tiers needs rebuilding
> of demotion targets. Also any change in node mask of memory tier needs
> demotion target rebuild. Can you clarify the code change you are suggesting here?

I think we should call establish_migration_targets() in
migrate_on_reclaim_callback() directly.  As the example I mentioned
above, sometimes, we don't need to call establish_migration_targets()
for each node changing.

>>>  	mutex_unlock(&memory_tier_lock);
>>>  }
>>>  
>>> @@ -98,8 +307,10 @@ static void clear_node_memory_tier(int node)
>>>  
>>>  	mutex_lock(&memory_tier_lock);
>>>  	memtier = __node_get_memory_tier(node);
>>> -	if (memtier)
>>> +	if (memtier) {
>>>  		node_clear(node, memtier->nodelist);
>>> +		establish_migration_targets();
>>> +	}
>>>  	mutex_unlock(&memory_tier_lock);
>>>  }
>>>  
>>> @@ -134,6 +345,11 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>>>  
>>>  static void __init migrate_on_reclaim_init(void)
>>>  {
>>> +	if (IS_ENABLED(CONFIG_MIGRATION)) {
>>> +		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
>> 
>> Why allocate MAX_NUMNODES instead of nr_node_ids as before?
>> 
>>> +					GFP_KERNEL);
>>> +		WARN_ON(!node_demotion);
>>> +	}
>>>  	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>>>  }
>>>  
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index fce7d4a9e940..c758c9c21d7d 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>>>  	return 0;
>>>  }
>>>  #endif /* CONFIG_NUMA_BALANCING */
>>> -
>>> -/*
>>> - * node_demotion[] example:
>>> - *
>>> - * Consider a system with two sockets.  Each socket has
>>> - * three classes of memory attached: fast, medium and slow.
>>> - * Each memory class is placed in its own NUMA node.  The
>>> - * CPUs are placed in the node with the "fast" memory.  The
>>> - * 6 NUMA nodes (0-5) might be split among the sockets like
>>> - * this:
>>> - *
>>> - *	Socket A: 0, 1, 2
>>> - *	Socket B: 3, 4, 5
>>> - *
>>> - * When Node 0 fills up, its memory should be migrated to
>>> - * Node 1.  When Node 1 fills up, it should be migrated to
>>> - * Node 2.  The migration path start on the nodes with the
>>> - * processors (since allocations default to this node) and
>>> - * fast memory, progress through medium and end with the
>>> - * slow memory:
>>> - *
>>> - *	0 -> 1 -> 2 -> stop
>>> - *	3 -> 4 -> 5 -> stop
>>> - *
>>> - * This is represented in the node_demotion[] like this:
>>> - *
>>> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
>>> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
>>> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
>>> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
>>> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
>>> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
>>> - *
>>> - * Moreover some systems may have multiple slow memory nodes.
>>> - * Suppose a system has one socket with 3 memory nodes, node 0
>>> - * is fast memory type, and node 1/2 both are slow memory
>>> - * type, and the distance between fast memory node and slow
>>> - * memory node is same. So the migration path should be:
>>> - *
>>> - *	0 -> 1/2 -> stop
>>> - *
>>> - * This is represented in the node_demotion[] like this:
>>> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
>>> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
>>> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
>>> - */
>>> -
>>> -/*
>>> - * Writes to this array occur without locking.  Cycles are
>>> - * not allowed: Node X demotes to Y which demotes to X...
>>> - *
>>> - * If multiple reads are performed, a single rcu_read_lock()
>>> - * must be held over all reads to ensure that no cycles are
>>> - * observed.
>>> - */
>>> -#define DEFAULT_DEMOTION_TARGET_NODES 15
>>> -
>>> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
>>> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
>>> -#else
>>> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
>>> -#endif
>>> -
>>> -struct demotion_nodes {
>>> -	unsigned short nr;
>>> -	short nodes[DEMOTION_TARGET_NODES];
>>> -};
>>> -
>>> -static struct demotion_nodes *node_demotion __read_mostly;
>>> -
>>> -/**
>>> - * next_demotion_node() - Get the next node in the demotion path
>>> - * @node: The starting node to lookup the next node
>>> - *
>>> - * Return: node id for next memory node in the demotion path hierarchy
>>> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
>>> - * @node online or guarantee that it *continues* to be the next demotion
>>> - * target.
>>> - */
>>> -int next_demotion_node(int node)
>>> -{
>>> -	struct demotion_nodes *nd;
>>> -	unsigned short target_nr, index;
>>> -	int target;
>>> -
>>> -	if (!node_demotion)
>>> -		return NUMA_NO_NODE;
>>> -
>>> -	nd = &node_demotion[node];
>>> -
>>> -	/*
>>> -	 * node_demotion[] is updated without excluding this
>>> -	 * function from running.  RCU doesn't provide any
>>> -	 * compiler barriers, so the READ_ONCE() is required
>>> -	 * to avoid compiler reordering or read merging.
>>> -	 *
>>> -	 * Make sure to use RCU over entire code blocks if
>>> -	 * node_demotion[] reads need to be consistent.
>>> -	 */
>>> -	rcu_read_lock();
>>> -	target_nr = READ_ONCE(nd->nr);
>>> -
>>> -	switch (target_nr) {
>>> -	case 0:
>>> -		target = NUMA_NO_NODE;
>>> -		goto out;
>>> -	case 1:
>>> -		index = 0;
>>> -		break;
>>> -	default:
>>> -		/*
>>> -		 * If there are multiple target nodes, just select one
>>> -		 * target node randomly.
>>> -		 *
>>> -		 * In addition, we can also use round-robin to select
>>> -		 * target node, but we should introduce another variable
>>> -		 * for node_demotion[] to record last selected target node,
>>> -		 * that may cause cache ping-pong due to the changing of
>>> -		 * last target node. Or introducing per-cpu data to avoid
>>> -		 * caching issue, which seems more complicated. So selecting
>>> -		 * target node randomly seems better until now.
>>> -		 */
>>> -		index = get_random_int() % target_nr;
>>> -		break;
>>> -	}
>>> -
>>> -	target = READ_ONCE(nd->nodes[index]);
>>> -
>>> -out:
>>> -	rcu_read_unlock();
>>> -	return target;
>>> -}
>>> -
>>> -/* Disable reclaim-based migration. */
>>> -static void __disable_all_migrate_targets(void)
>>> -{
>>> -	int node, i;
>>> -
>>> -	if (!node_demotion)
>>> -		return;
>>> -
>>> -	for_each_online_node(node) {
>>> -		node_demotion[node].nr = 0;
>>> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
>>> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
>>> -	}
>>> -}
>>> -
>>> -static void disable_all_migrate_targets(void)
>>> -{
>>> -	__disable_all_migrate_targets();
>>> -
>>> -	/*
>>> -	 * Ensure that the "disable" is visible across the system.
>>> -	 * Readers will see either a combination of before+disable
>>> -	 * state or disable+after.  They will never see before and
>>> -	 * after state together.
>>> -	 *
>>> -	 * The before+after state together might have cycles and
>>> -	 * could cause readers to do things like loop until this
>>> -	 * function finishes.  This ensures they can only see a
>>> -	 * single "bad" read and would, for instance, only loop
>>> -	 * once.
>>> -	 */
>>> -	synchronize_rcu();
>>> -}
>>> -
>>> -/*
>>> - * Find an automatic demotion target for 'node'.
>>> - * Failing here is OK.  It might just indicate
>>> - * being at the end of a chain.
>>> - */
>>> -static int establish_migrate_target(int node, nodemask_t *used,
>>> -				    int best_distance)
>>> -{
>>> -	int migration_target, index, val;
>>> -	struct demotion_nodes *nd;
>>> -
>>> -	if (!node_demotion)
>>> -		return NUMA_NO_NODE;
>>> -
>>> -	nd = &node_demotion[node];
>>> -
>>> -	migration_target = find_next_best_node(node, used);
>>> -	if (migration_target == NUMA_NO_NODE)
>>> -		return NUMA_NO_NODE;
>>> -
>>> -	/*
>>> -	 * If the node has been set a migration target node before,
>>> -	 * which means it's the best distance between them. Still
>>> -	 * check if this node can be demoted to other target nodes
>>> -	 * if they have a same best distance.
>>> -	 */
>>> -	if (best_distance != -1) {
>>> -		val = node_distance(node, migration_target);
>>> -		if (val > best_distance)
>>> -			goto out_clear;
>>> -	}
>>> -
>>> -	index = nd->nr;
>>> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
>>> -		      "Exceeds maximum demotion target nodes\n"))
>>> -		goto out_clear;
>>> -
>>> -	nd->nodes[index] = migration_target;
>>> -	nd->nr++;
>>> -
>>> -	return migration_target;
>>> -out_clear:
>>> -	node_clear(migration_target, *used);
>>> -	return NUMA_NO_NODE;
>>> -}
>>> -
>>> -/*
>>> - * When memory fills up on a node, memory contents can be
>>> - * automatically migrated to another node instead of
>>> - * discarded at reclaim.
>>> - *
>>> - * Establish a "migration path" which will start at nodes
>>> - * with CPUs and will follow the priorities used to build the
>>> - * page allocator zonelists.
>>> - *
>>> - * The difference here is that cycles must be avoided.  If
>>> - * node0 migrates to node1, then neither node1, nor anything
>>> - * node1 migrates to can migrate to node0. Also one node can
>>> - * be migrated to multiple nodes if the target nodes all have
>>> - * a same best-distance against the source node.
>>> - *
>>> - * This function can run simultaneously with readers of
>>> - * node_demotion[].  However, it can not run simultaneously
>>> - * with itself.  Exclusion is provided by memory hotplug events
>>> - * being single-threaded.
>>> - */
>>> -static void __set_migration_target_nodes(void)
>>> -{
>>> -	nodemask_t next_pass;
>>> -	nodemask_t this_pass;
>>> -	nodemask_t used_targets = NODE_MASK_NONE;
>>> -	int node, best_distance;
>>> -
>>> -	/*
>>> -	 * Avoid any oddities like cycles that could occur
>>> -	 * from changes in the topology.  This will leave
>>> -	 * a momentary gap when migration is disabled.
>>> -	 */
>>> -	disable_all_migrate_targets();
>>> -
>>> -	/*
>>> -	 * Allocations go close to CPUs, first.  Assume that
>>> -	 * the migration path starts at the nodes with CPUs.
>>> -	 */
>>> -	next_pass = node_states[N_CPU];
>>> -again:
>>> -	this_pass = next_pass;
>>> -	next_pass = NODE_MASK_NONE;
>>> -	/*
>>> -	 * To avoid cycles in the migration "graph", ensure
>>> -	 * that migration sources are not future targets by
>>> -	 * setting them in 'used_targets'.  Do this only
>>> -	 * once per pass so that multiple source nodes can
>>> -	 * share a target node.
>>> -	 *
>>> -	 * 'used_targets' will become unavailable in future
>>> -	 * passes.  This limits some opportunities for
>>> -	 * multiple source nodes to share a destination.
>>> -	 */
>>> -	nodes_or(used_targets, used_targets, this_pass);
>>> -
>>> -	for_each_node_mask(node, this_pass) {
>>> -		best_distance = -1;
>>> -
>>> -		/*
>>> -		 * Try to set up the migration path for the node, and the target
>>> -		 * migration nodes can be multiple, so doing a loop to find all
>>> -		 * the target nodes if they all have a best node distance.
>>> -		 */
>>> -		do {
>>> -			int target_node =
>>> -				establish_migrate_target(node, &used_targets,
>>> -							 best_distance);
>>> -
>>> -			if (target_node == NUMA_NO_NODE)
>>> -				break;
>>> -
>>> -			if (best_distance == -1)
>>> -				best_distance = node_distance(node, target_node);
>>> -
>>> -			/*
>>> -			 * Visit targets from this pass in the next pass.
>>> -			 * Eventually, every node will have been part of
>>> -			 * a pass, and will become set in 'used_targets'.
>>> -			 */
>>> -			node_set(target_node, next_pass);
>>> -		} while (1);
>>> -	}
>>> -	/*
>>> -	 * 'next_pass' contains nodes which became migration
>>> -	 * targets in this pass.  Make additional passes until
>>> -	 * no more migrations targets are available.
>>> -	 */
>>> -	if (!nodes_empty(next_pass))
>>> -		goto again;
>>> -}
>>> -
>>> -/*
>>> - * For callers that do not hold get_online_mems() already.
>>> - */
>>> -void set_migration_target_nodes(void)
>>> -{
>>> -	get_online_mems();
>>> -	__set_migration_target_nodes();
>>> -	put_online_mems();
>>> -}
>>> -
>>> -/*
>>> - * This leaves migrate-on-reclaim transiently disabled between
>>> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
>>> - * whether reclaim-based migration is enabled or not, which
>>> - * ensures that the user can turn reclaim-based migration at
>>> - * any time without needing to recalculate migration targets.
>>> - *
>>> - * These callbacks already hold get_online_mems().  That is why
>>> - * __set_migration_target_nodes() can be used as opposed to
>>> - * set_migration_target_nodes().
>>> - */
>>> -#ifdef CONFIG_MEMORY_HOTPLUG
>>> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>>> -						 unsigned long action, void *_arg)
>>> -{
>>> -	struct memory_notify *arg = _arg;
>>> -
>>> -	/*
>>> -	 * Only update the node migration order when a node is
>>> -	 * changing status, like online->offline.  This avoids
>>> -	 * the overhead of synchronize_rcu() in most cases.
>>> -	 */
>>> -	if (arg->status_change_nid < 0)
>>> -		return notifier_from_errno(0);
>>> -
>>> -	switch (action) {
>>> -	case MEM_GOING_OFFLINE:
>>> -		/*
>>> -		 * Make sure there are not transient states where
>>> -		 * an offline node is a migration target.  This
>>> -		 * will leave migration disabled until the offline
>>> -		 * completes and the MEM_OFFLINE case below runs.
>>> -		 */
>>> -		disable_all_migrate_targets();
>>> -		break;
>>> -	case MEM_OFFLINE:
>>> -	case MEM_ONLINE:
>>> -		/*
>>> -		 * Recalculate the target nodes once the node
>>> -		 * reaches its final state (online or offline).
>>> -		 */
>>> -		__set_migration_target_nodes();
>>> -		break;
>>> -	case MEM_CANCEL_OFFLINE:
>>> -		/*
>>> -		 * MEM_GOING_OFFLINE disabled all the migration
>>> -		 * targets.  Reenable them.
>>> -		 */
>>> -		__set_migration_target_nodes();
>>> -		break;
>>> -	case MEM_GOING_ONLINE:
>>> -	case MEM_CANCEL_ONLINE:
>>> -		break;
>>> -	}
>>> -
>>> -	return notifier_from_errno(0);
>>> -}
>>> -#endif
>>> -
>>> -void __init migrate_on_reclaim_init(void)
>>> -{
>>> -	node_demotion = kcalloc(nr_node_ids,
>>> -				sizeof(struct demotion_nodes),
>>> -				GFP_KERNEL);
>>> -	WARN_ON(!node_demotion);
>>> -#ifdef CONFIG_MEMORY_HOTPLUG
>>> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>>> -#endif
>>> -	/*
>>> -	 * At this point, all numa nodes with memory/CPus have their state
>>> -	 * properly set, so we can build the demotion order now.
>>> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
>>> -	 * CPU hotplug events during boot.
>>> -	 */
>>> -	cpus_read_lock();
>>> -	set_migration_target_nodes();
>>> -	cpus_read_unlock();
>>> -}
>>>  #endif /* CONFIG_NUMA */
>>> -
>>> -
>>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>>> index 373d2730fcf2..35c6ff97cf29 100644
>>> --- a/mm/vmstat.c
>>> +++ b/mm/vmstat.c
>>> @@ -28,7 +28,6 @@
>>>  #include <linux/mm_inline.h>
>>>  #include <linux/page_ext.h>
>>>  #include <linux/page_owner.h>
>>> -#include <linux/migrate.h>
>>>  
>>>  #include "internal.h"
>>>  
>>> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>>>  
>>>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>>>  		node_set_state(cpu_to_node(cpu), N_CPU);
>>> -		set_migration_target_nodes();
>>>  	}
>> 
>> "{" and "}" can be removed too?
>> 
>
> I will update  the patch incorporating other changes you suggested.

Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-26 12:03     ` Aneesh Kumar K V
@ 2022-07-27  1:53       ` Huang, Ying
  2022-07-27  4:38         ` Aneesh Kumar K.V
  0 siblings, 1 reply; 39+ messages in thread
From: Huang, Ying @ 2022-07-27  1:53 UTC (permalink / raw)
  To: Aneesh Kumar K V, Alistair Popple
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Dan Williams, Johannes Weiner,
	jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/26/22 9:33 AM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> If the new NUMA node onlined doesn't have a performance level assigned,
>>> the kernel adds the NUMA node to default memory tier.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/memory-tiers.h |  1 +
>>>  mm/memory-tiers.c            | 75 ++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 76 insertions(+)
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> index ef380a39db3a..3d5f14d57ae6 100644
>>> --- a/include/linux/memory-tiers.h
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -14,6 +14,7 @@
>>>  #define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>  /* leave one tier below this slow pmem */
>>>  #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>> +#define MEMTIER_HOTPLUG_PRIO	100
>>>  
>>>  extern bool numa_demotion_enabled;
>>>  
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index 41a21cc5ae55..cc3a47ec18e4 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -5,6 +5,7 @@
>>>  #include <linux/lockdep.h>
>>>  #include <linux/moduleparam.h>
>>>  #include <linux/node.h>
>>> +#include <linux/memory.h>
>>>  #include <linux/memory-tiers.h>
>>>  
>>>  struct memory_tier {
>>> @@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>>>  	return new_memtier;
>>>  }
>>>  
>>> +static struct memory_tier *__node_get_memory_tier(int node)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>> +		if (node_isset(node, memtier->nodelist))
>>> +			return memtier;
>>> +	}
>>> +	return NULL;
>>> +}
>>> +
>>> +static void init_node_memory_tier(int node)
>> 
>> set_node_memory_tier()?
>
> That was done based on feedback from Alistair 
>
> https://lore.kernel.org/linux-mm/87h73iapg1.fsf@nvdebian.thelocal
>
>> 
>>> +{
>>> +	int perf_level;
>>> +	struct memory_tier *memtier;
>>> +
>>> +	mutex_lock(&memory_tier_lock);
>>> +
>>> +	memtier = __node_get_memory_tier(node);
>>> +	if (!memtier) {
>>> +		perf_level = node_devices[node]->perf_level;
>>> +		memtier = find_create_memory_tier(perf_level);
>>> +		node_set(node, memtier->nodelist);
>>> +	}

It's related to Alistair's comments too.  When will memtier != NULL
here?  We may need just VM_WARN_ON() here?

>>> +	mutex_unlock(&memory_tier_lock);
>>> +}
>>> +
>>> +static void clear_node_memory_tier(int node)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	mutex_lock(&memory_tier_lock);
>>> +	memtier = __node_get_memory_tier(node);
>>> +	if (memtier)
>>> +		node_clear(node, memtier->nodelist);
>> 
>> When memtier->nodelist becomes empty, we need to free memtier?
>> 
>>> +	mutex_unlock(&memory_tier_lock);
>>> +}
>>> +
>>> +/*
>>> + * This runs whether reclaim-based migration is enabled or not,
>>> + * which ensures that the user can turn reclaim-based migration
>>> + * at any time without needing to recalculate migration targets.
>>> + */
>> 
>> The comments doesn't apply here.
>> 
>>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>>> +						 unsigned long action, void *_arg)
>> 
>> Now we are building memory tiers instead of working on demotion.  So I
>> think we should rename the function to memtier_hotplug_callback().
>> 
>>> +{
>>> +	struct memory_notify *arg = _arg;
>>> +
>>> +	/*
>>> +	 * Only update the node migration order when a node is
>>> +	 * changing status, like online->offline.
>>> +	 */
>>> +	if (arg->status_change_nid < 0)
>>> +		return notifier_from_errno(0);
>>> +
>>> +	switch (action) {
>>> +	case MEM_OFFLINE:
>>> +		clear_node_memory_tier(arg->status_change_nid);
>>> +		break;
>>> +	case MEM_ONLINE:
>>> +		init_node_memory_tier(arg->status_change_nid);
>>> +		break;
>>> +	}
>>> +
>>> +	return notifier_from_errno(0);
>>> +}
>>> +
>>> +static void __init migrate_on_reclaim_init(void)
>>> +{
>>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>>> +}
>> 
>> I suggest to call hotplug_memory_notifier() in memory_tier_init()
>> directly.  We are not working on demotion here.
>> 
>>> +
>>>  static int __init memory_tier_init(void)
>>>  {
>>>  	int node;
>>> @@ -96,6 +169,8 @@ static int __init memory_tier_init(void)
>>>  			node_property->perf_level = default_memtier_perf_level;
>>>  	}
>>>  	mutex_unlock(&memory_tier_lock);
>>> +
>>> +	migrate_on_reclaim_init();
>>>  	return 0;
>>>  }
>>>  subsys_initcall(memory_tier_init);
>> 
>> Best Regards,
>> Huang, Ying
>
>
> Will update the patch in next iteration to take care of other feedback.

Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-26  2:13           ` Huang, Ying
@ 2022-07-27  4:31             ` Aneesh Kumar K.V
  2022-07-28  6:39               ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-27  4:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
>> On 7/25/22 2:05 PM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>> 
>>>> On 7/25/22 12:07 PM, Huang, Ying wrote:
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> By default, all nodes are assigned to the default memory tier which
>>>>>> is the memory tier designated for nodes with DRAM
>>>>>>
>>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>>>>>> appears below the default memory tier in demotion order.
>>>>>>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> ---
>>>>>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>>>>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>>>>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>> index 82cae08976bc..3b6164418d6f 100644
>>>>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>> @@ -14,6 +14,8 @@
>>>>>>  #include <linux/delay.h>
>>>>>>  #include <linux/seq_buf.h>
>>>>>>  #include <linux/nd.h>
>>>>>> +#include <linux/memory.h>
>>>>>> +#include <linux/memory-tiers.h>
>>>>>>  
>>>>>>  #include <asm/plpar_wrappers.h>
>>>>>>  #include <asm/papr_pdsm.h>
>>>>>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>>>>>  	bool hcall_flush_required;
>>>>>>  
>>>>>>  	uint64_t bound_addr;
>>>>>> +	int target_node;
>>>>>>  
>>>>>>  	struct nvdimm_bus_descriptor bus_desc;
>>>>>>  	struct nvdimm_bus *bus;
>>>>>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>>  	p->bus_desc.module = THIS_MODULE;
>>>>>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>>>>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>>>>>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>>>>>  
>>>>>>  	/* Set the dimm command family mask to accept PDSMs */
>>>>>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>>>>>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>>>>>  
>>>>>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>>>>>> -	target_nid = dev_to_node(&p->pdev->dev);
>>>>>> +	target_nid = p->target_node;
>>>>>>  	online_nid = numa_map_to_online_node(target_nid);
>>>>>>  	ndr_desc.numa_node = online_nid;
>>>>>>  	ndr_desc.target_node = target_nid;
>>>>>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>>>>>  	},
>>>>>>  };
>>>>>>  
>>>>>> +static int papr_scm_callback(struct notifier_block *self,
>>>>>> +			     unsigned long action, void *arg)
>>>>>> +{
>>>>>> +	struct memory_notify *mnb = arg;
>>>>>> +	int nid = mnb->status_change_nid;
>>>>>> +	struct papr_scm_priv *p;
>>>>>> +
>>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>>> +		return NOTIFY_OK;
>>>>>> +
>>>>>> +	mutex_lock(&papr_ndr_lock);
>>>>>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>>>>>> +		if (p->target_node == nid) {
>>>>>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	}
>>>>>> +
>>>>>> +	mutex_unlock(&papr_ndr_lock);
>>>>>> +	return NOTIFY_OK;
>>>>>> +}
>>>>>> +
>>>>>>  static int __init papr_scm_init(void)
>>>>>>  {
>>>>>>  	int ret;
>>>>>>  
>>>>>>  	ret = platform_driver_register(&papr_scm_driver);
>>>>>> -	if (!ret)
>>>>>> -		mce_register_notifier(&mce_ue_nb);
>>>>>> -
>>>>>> -	return ret;
>>>>>> +	if (ret)
>>>>>> +		return ret;
>>>>>> +	mce_register_notifier(&mce_ue_nb);
>>>>>> +	/*
>>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>>> +	 * can update the perf level for the node.
>>>>>> +	 */
>>>>>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>>> +	return 0;
>>>>>>  }
>>>>>>  module_init(papr_scm_init);
>>>>>>  
>>>>>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>>>>>> index ae5f4acf2675..7ea1017ef790 100644
>>>>>> --- a/drivers/acpi/nfit/core.c
>>>>>> +++ b/drivers/acpi/nfit/core.c
>>>>>> @@ -15,6 +15,8 @@
>>>>>>  #include <linux/sort.h>
>>>>>>  #include <linux/io.h>
>>>>>>  #include <linux/nd.h>
>>>>>> +#include <linux/memory.h>
>>>>>> +#include <linux/memory-tiers.h>
>>>>>>  #include <asm/cacheflush.h>
>>>>>>  #include <acpi/nfit.h>
>>>>>>  #include "intel.h"
>>>>>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>>>>>  	},
>>>>>>  };
>>>>>>  
>>>>>> +static int nfit_callback(struct notifier_block *self,
>>>>>> +			 unsigned long action, void *arg)
>>>>>> +{
>>>>>> +	bool found = false;
>>>>>> +	struct memory_notify *mnb = arg;
>>>>>> +	int nid = mnb->status_change_nid;
>>>>>> +	struct nfit_spa *nfit_spa;
>>>>>> +	struct acpi_nfit_desc *acpi_desc;
>>>>>> +
>>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>>> +		return NOTIFY_OK;
>>>>>> +
>>>>>> +	mutex_lock(&acpi_desc_lock);
>>>>>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>>>>> +		mutex_lock(&acpi_desc->init_mutex);
>>>>>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>>>>>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>>>>> +			int target_node = pxm_to_node(spa->proximity_domain);
>>>>>> +
>>>>>> +			if (target_node == nid) {
>>>>>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>>> +				found = true;
>>>>>> +				break;
>>>>>> +			}
>>>>>> +		}
>>>>>> +		mutex_unlock(&acpi_desc->init_mutex);
>>>>>> +		if (found)
>>>>>> +			break;
>>>>>> +	}
>>>>>> +	mutex_unlock(&acpi_desc_lock);
>>>>>> +	return NOTIFY_OK;
>>>>>> +}
>>>>>> +
>>>>>>  static __init int nfit_init(void)
>>>>>>  {
>>>>>>  	int ret;
>>>>>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>>>>>  		nfit_mce_unregister();
>>>>>>  		destroy_workqueue(nfit_wq);
>>>>>>  	}
>>>>>> -
>>>>>> +	/*
>>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>>> +	 * can update the perf level for the node.
>>>>>> +	 */
>>>>>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>>>  	return ret;
>>>>>>  
>>>>>>  }
>>>>>
>>>>> I don't think that it's a good idea to set perf_level of a memory device
>>>>> (node) via NFIT only.
>>>>
>>>>>
>>>>> For example, we may prefer HMAT over NFIT when it's available.  So the
>>>>> perf_level should be set in dax/kmem.c based on information provided by
>>>>> ACPI or other information sources.  ACPI can provide some functions/data
>>>>> structures to let drivers (like dax/kmem.c) to query the properties of
>>>>> the memory device (node).
>>>>>
>>>>
>>>> I was trying to make it architecture specific so that we have a placeholder
>>>> to fine-tune this better. For example, ppc64 will look at device tree
>>>> details to find the performance level and x86 will look at ACPI data structure.
>>>> Adding that hotplug callback in dax/kmem will prevent that architecture-specific
>>>> customization? 
>>>>
>>>> I would expect that callback to move to the generic ACPI layer so that even
>>>> firmware managed CXL devices can be added to a lower tier?  I don't understand
>>>> ACPI enough to find the right abstraction for that hotplug callback. 
>>> 
>>> I'm OK for this to be architecture specific.
>>> 
>>> But ACPI NFIT isn't enough for x86.  For example, PMEM can be added to a
>>> virtual machine as normal memory nodes without NFIT.  Instead, PMEM is
>>> marked via "memmap=<nn>G!<ss>G" or "efi_fake_mem=<nn>G@<ss>G:0x40000",
>>> and dax/kmem.c is used to hot-add the memory.
>>> 
>>> So, before a more sophisticated version is implemented for x86.  The
>>> simplest version as I suggested below works even better.
>>> 
>>>>> As the simplest first version, this can be just hard coded.
>>>>>
>>>>
>>>> If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
>>>> get allocated pretty late when we try to online the node. 
>>> 
>>> As the simplest first version, this can be as simple as,
>>> 
>>> /* dax/kmem.c */
>>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>> {
>>> 	node_devices[dev_dax->target_node]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>> 	/* add_memory_driver_managed() */
>>> }
>>> 
>>> To be compatible with ppc64 version, how about make dev_dax_kmem_probe()
>>> set perf_level only if it's uninitialized?
>>
>> That will result in kernel crash because node_devices[dev_dax->target_node] is not initialized there. 
>>
>> it get allocated in add_memory_resource -> __try_online_node ->
>> register_one_node -> __register_one_node -> node_devices[nid] =
>> kzalloc(sizeof(struct node), GFP_KERNEL);
>
> Ah, right!  So we need some other way to do that, for example, a global
> array as follows,
>
>   int node_perf_levels[MAX_NUMNODES];

This would be much simpler than adding memory_type, but then it is a
memory device property and hence it will be a good idea to group
it together with other properties in node_devices[]. We could look at
allocating nodes_devices[] for dax/kmem nodes from within the driver?

I did implement memory_type and it do bring some additional complexity
though it simplfy the interface. 

I was looking at the platform drivers calling
struct memory_dev_type *register_memory_type(int perf_level, int node)
to register a new memory_type. if dax/kmem don't find a memory_dev_type
registered for the NUMA node it will assign default pmem type.

	if (!node_memory_types[numa_node]) {
		/*
		 * low level drivers didn't initialize the memory type.
		 * assign a default level.
		 */
		node_memory_types[numa_node] = &default_pmem_type;
		node_set(numa_node, default_pmem_type.nodes);
	}

This should allow ACPI or papr_scm to fine tune the memory type of
the deivce they are initializing

>
> And, I think that we need to consider the memory type here too.  As
> suggested by Johannes, memory type describes a set of memory devices
> (NUMA nodes) with same performance character (that is, abstract distance
> or perf level).  The data structure can be something as follows,
>
>   struct memory_type {
>         int perf_level;
>         struct list_head tier_sibling;
>         nodemask_t nodes;
>   };

memory type is already used in include/linux/memremap.h

enum memory_type 

How about struct memory_dev_type? 
	
How about we also add memtier here that is only accessed with
memory_tier_lock held? That will allow easy access to the memory
tier this type belongs

>
> The memory_type should be created and managed by the device drivers
> (e.g., dax/kmem) which manages the memory devices.  In the future, we
> will represent it in sysfs, and a per-memory_type knob will be provided
> to offset the perf_level of all memory devices managed by the
> memory_type.
>
> With memory_type, the memory_tier becomes,
>
>   struct memory_tier {
>         int perf_level_start;
>         struct list_head sibling;
>         struct list_head memory_types;
>   };
>
> And we need an array to map from nid to memory_type, e.g., as follows,
>
>   struct memory_type *node_memory_types[MAX_NUMNODES];

Changing the perf level of a memory devices will move it from one
memory type to the other and such a move should will also results
in updating node's memory tier. ie, it will be something like below

static void update_node_perf_level(int node, struct memory_dev_type *memtype)
{
	pg_data_t *pgdat;
	struct memory_dev_type *current_memtype;
	struct memory_tier *memtier;

	pgdat = NODE_DATA(node);
	if (!pgdat)
		return;

	mutex_lock(&memory_tier_lock);
	/*
	 * Make sure we mark the memtier NULL before we assign the new memory tier
	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
	 * finds a NULL memtier or the one which is still valid.
	 */
	rcu_assign_pointer(pgdat->memtier, NULL);
	synchronize_rcu();

	if (!memtype->memtier) {
		memtier = find_create_memory_tier(memtype);
		if (IS_ERR(memtier))
			goto err_out;
	}
	current_memtype = node_memory_types[node];
	node_clear(node, current_memtype->nodes);
	/*
	 * If current memtype becomes empty, we should kill the memory tiers
	 */
	node_set(node, memtype->nodes);
        node_memory_types[node] = memtype;
	rcu_assign_pointer(pgdat->memtier, memtype->memtier);
	establish_migration_targets();
err_out:
	mutex_unlock(&memory_tier_lock);
}


>
> We need to manage the memory_type in device drivers, instead of ACPI or
> device tree callbacks.
>
> Because memory_type is an important part of the explicit memory tier
> implementation and may influence the design, I suggest to include it in
> the implementation now.  It appears not too complex anyway.
>

One thing I observed while implementing this is the additional
complexity while walking the memory tiers. Any tier related operation
impacting memory numa nodes now becomes a two linked list walk as below.

ex:
list_for_each_entry(memtier, &memory_tiers, list) {
	list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling)
		nodes_or(nodes, nodes, memtype->nodes);

As we offline N_MEMORY nodes, we now have to do

	memtype = node_memory_types[node];
        if (nodes_empty(memtype->nodes)) {
        	list_del(&memtype->tier_sibiling);
                        if (list_empty(&current_memtier->memory_types))
                        	destroy_memory_tier(current_memtier);

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-27  1:40       ` Huang, Ying
@ 2022-07-27  4:35         ` Aneesh Kumar K.V
  2022-07-28  6:51           ` Huang, Ying
  2022-08-03  3:18         ` Aneesh Kumar K.V
  1 sibling, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-27  4:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
>> On 7/26/22 1:14 PM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> 

....

>>> + */
>>>> +int next_demotion_node(int node)
>>>> +{
>>>> +	struct demotion_nodes *nd;
>>>> +	int target;
>>>> +
>>>> +	if (!node_demotion)
>>>> +		return NUMA_NO_NODE;
>>>> +
>>>> +	nd = &node_demotion[node];
>>>> +
>>>> +	/*
>>>> +	 * node_demotion[] is updated without excluding this
>>>> +	 * function from running.
>>>> +	 *
>>>> +	 * Make sure to use RCU over entire code blocks if
>>>> +	 * node_demotion[] reads need to be consistent.
>>>> +	 */
>>>> +	rcu_read_lock();
>>>> +	/*
>>>> +	 * If there are multiple target nodes, just select one
>>>> +	 * target node randomly.
>>>> +	 *
>>>> +	 * In addition, we can also use round-robin to select
>>>> +	 * target node, but we should introduce another variable
>>>> +	 * for node_demotion[] to record last selected target node,
>>>> +	 * that may cause cache ping-pong due to the changing of
>>>> +	 * last target node. Or introducing per-cpu data to avoid
>>>> +	 * caching issue, which seems more complicated. So selecting
>>>> +	 * target node randomly seems better until now.
>>>> +	 */
>>>> +	target = node_random(&nd->preferred);
>>> 
>>> In one of the most common cases, nodes_weight(&nd->preferred) == 1.
>>> Where, get_random_int() in node_random() just wastes CPU cycles and
>>> random entropy.  So the original struct demotion_nodes implementation
>>> appears better.
>>> 
>>>   struct demotion_nodes {
>>>          unsigned short nr;
>>>          short nodes[DEMOTION_TARGET_NODES];
>>>   };
>>> 
>>
>>
>> Is that measurable difference? using nodemask_t makes it much easier with respect to
>> implementation. IMHO if we observe the usage of node_random() to have performance impact
>> with nodes_weight() == 1 we should fix node_random() to handle that? If you strongly
>> feel we should fix this, i can opencode node_random to special case node_weight() == 1?
>
> If there's no much difference, why not just use the existing code?
> IMHO, it's your responsibility to prove your new implementation is
> better via numbers, for example, reduced code lines, with better or same
> performance.
>
> Another policy is just to use the existing code in the first version.
> Then change it based on measurement.

One of the reason I switched to nodemask_t is to make code simpler.
demotion target is essentially a node mask. 

>
> In general, I care more about the most common cases, that is, 0 or 1
> demotion target.

How about I switch to the below opencoded version. That should take care
of the above concern. 

>
>> -	target = node_random(&nd->preferred);
>> +	node_weight = nodes_weight(nd->preferred);
>> +	switch (node_weight) {
>> +	case 0:
>> +		target = NUMA_NO_NODE;
>> +		break;
>> +	case 1:
>> +		target = first_node(nd->preferred);
>> +		break;
>> +	default:
>> +		target = bitmap_ord_to_pos(nd->preferred.bits,
>> +					   get_random_int() % node_weight, MAX_NUMNODES);
>> +		break;
>> +	}
>>  
>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-27  1:53       ` Huang, Ying
@ 2022-07-27  4:38         ` Aneesh Kumar K.V
  2022-07-28  6:42           ` Huang, Ying
  0 siblings, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-27  4:38 UTC (permalink / raw)
  To: Huang, Ying, Alistair Popple
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Dan Williams, Johannes Weiner,
	jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
>> On 7/26/22 9:33 AM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> 
>>>> If the new NUMA node onlined doesn't have a performance level assigned,
>>>> the kernel adds the NUMA node to default memory tier.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>  include/linux/memory-tiers.h |  1 +
>>>>  mm/memory-tiers.c            | 75 ++++++++++++++++++++++++++++++++++++
>>>>  2 files changed, 76 insertions(+)
>>>>
>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>> index ef380a39db3a..3d5f14d57ae6 100644
>>>> --- a/include/linux/memory-tiers.h
>>>> +++ b/include/linux/memory-tiers.h
>>>> @@ -14,6 +14,7 @@
>>>>  #define MEMTIER_PERF_LEVEL_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>>  /* leave one tier below this slow pmem */
>>>>  #define MEMTIER_PERF_LEVEL_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>> +#define MEMTIER_HOTPLUG_PRIO	100
>>>>  
>>>>  extern bool numa_demotion_enabled;
>>>>  
>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>> index 41a21cc5ae55..cc3a47ec18e4 100644
>>>> --- a/mm/memory-tiers.c
>>>> +++ b/mm/memory-tiers.c
>>>> @@ -5,6 +5,7 @@
>>>>  #include <linux/lockdep.h>
>>>>  #include <linux/moduleparam.h>
>>>>  #include <linux/node.h>
>>>> +#include <linux/memory.h>
>>>>  #include <linux/memory-tiers.h>
>>>>  
>>>>  struct memory_tier {
>>>> @@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
>>>>  	return new_memtier;
>>>>  }
>>>>  
>>>> +static struct memory_tier *__node_get_memory_tier(int node)
>>>> +{
>>>> +	struct memory_tier *memtier;
>>>> +
>>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>>> +		if (node_isset(node, memtier->nodelist))
>>>> +			return memtier;
>>>> +	}
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static void init_node_memory_tier(int node)
>>> 
>>> set_node_memory_tier()?
>>
>> That was done based on feedback from Alistair 
>>
>> https://lore.kernel.org/linux-mm/87h73iapg1.fsf@nvdebian.thelocal
>>
>>> 
>>>> +{
>>>> +	int perf_level;
>>>> +	struct memory_tier *memtier;
>>>> +
>>>> +	mutex_lock(&memory_tier_lock);
>>>> +
>>>> +	memtier = __node_get_memory_tier(node);
>>>> +	if (!memtier) {
>>>> +		perf_level = node_devices[node]->perf_level;
>>>> +		memtier = find_create_memory_tier(perf_level);
>>>> +		node_set(node, memtier->nodelist);
>>>> +	}
>
> It's related to Alistair's comments too.  When will memtier != NULL
> here?  We may need just VM_WARN_ON() here?

When the platform driver sets memory tier directly. With the old code
it can happen when dax/kmem register a node to a memory tier. With
memory_type proposal this can happen if the node is part of memory
type that is already added to a memory tier. 

>
>>>> +	mutex_unlock(&memory_tier_lock);
>>>> +}
>>>> +
>>>> +static void clear_node_memory_tier(int node)
>>>> +{
>>>> +	struct memory_tier *memtier;
>>>> +
>>>> +	mutex_lock(&memory_tier_lock);
>>>> +	memtier = __node_get_memory_tier(node);
>>>> +	if (memtier)
>>>> +		node_clear(node, memtier->nodelist);
>>> 
>>> When memtier->nodelist becomes empty, we need to free memtier?
>>> 
>>>> +	mutex_unlock(&memory_tier_lock);
>>>> +}
>>>> +
>>>> +/*
>>>> + * This runs whether reclaim-based migration is enabled or not,
>>>> + * which ensures that the user can turn reclaim-based migration
>>>> + * at any time without needing to recalculate migration targets.
>>>> + */
>>> 
>>> The comments doesn't apply here.
>>> 
>>>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>>>> +						 unsigned long action, void *_arg)
>>> 
>>> Now we are building memory tiers instead of working on demotion.  So I
>>> think we should rename the function to memtier_hotplug_callback().
>>> 
>>>> +{
>>>> +	struct memory_notify *arg = _arg;
>>>> +
>>>> +	/*
>>>> +	 * Only update the node migration order when a node is
>>>> +	 * changing status, like online->offline.
>>>> +	 */
>>>> +	if (arg->status_change_nid < 0)
>>>> +		return notifier_from_errno(0);
>>>> +
>>>> +	switch (action) {
>>>> +	case MEM_OFFLINE:
>>>> +		clear_node_memory_tier(arg->status_change_nid);
>>>> +		break;
>>>> +	case MEM_ONLINE:
>>>> +		init_node_memory_tier(arg->status_change_nid);
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	return notifier_from_errno(0);
>>>> +}
>>>> +
>>>> +static void __init migrate_on_reclaim_init(void)
>>>> +{
>>>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO);
>>>> +}
>>> 
>>> I suggest to call hotplug_memory_notifier() in memory_tier_init()
>>> directly.  We are not working on demotion here.
>>> 
>>>> +
>>>>  static int __init memory_tier_init(void)
>>>>  {
>>>>  	int node;
>>>> @@ -96,6 +169,8 @@ static int __init memory_tier_init(void)
>>>>  			node_property->perf_level = default_memtier_perf_level;
>>>>  	}
>>>>  	mutex_unlock(&memory_tier_lock);
>>>> +
>>>> +	migrate_on_reclaim_init();
>>>>  	return 0;
>>>>  }
>>>>  subsys_initcall(memory_tier_init);
>>> 
>>> Best Regards,
>>> Huang, Ying
>>
>>
>> Will update the patch in next iteration to take care of other feedback.
>
> Thanks!
>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM
  2022-07-27  4:31             ` Aneesh Kumar K.V
@ 2022-07-28  6:39               ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-28  6:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 7/25/22 2:05 PM, Huang, Ying wrote:
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>> 
>>>>> On 7/25/22 12:07 PM, Huang, Ying wrote:
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> By default, all nodes are assigned to the default memory tier which
>>>>>>> is the memory tier designated for nodes with DRAM
>>>>>>>
>>>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>>>> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier
>>>>>>> appears below the default memory tier in demotion order.
>>>>>>>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> ---
>>>>>>>  arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++---
>>>>>>>  drivers/acpi/nfit/core.c                  | 41 ++++++++++++++++++++++-
>>>>>>>  2 files changed, 76 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>>> index 82cae08976bc..3b6164418d6f 100644
>>>>>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>>>>>>> @@ -14,6 +14,8 @@
>>>>>>>  #include <linux/delay.h>
>>>>>>>  #include <linux/seq_buf.h>
>>>>>>>  #include <linux/nd.h>
>>>>>>> +#include <linux/memory.h>
>>>>>>> +#include <linux/memory-tiers.h>
>>>>>>>  
>>>>>>>  #include <asm/plpar_wrappers.h>
>>>>>>>  #include <asm/papr_pdsm.h>
>>>>>>> @@ -98,6 +100,7 @@ struct papr_scm_priv {
>>>>>>>  	bool hcall_flush_required;
>>>>>>>  
>>>>>>>  	uint64_t bound_addr;
>>>>>>> +	int target_node;
>>>>>>>  
>>>>>>>  	struct nvdimm_bus_descriptor bus_desc;
>>>>>>>  	struct nvdimm_bus *bus;
>>>>>>> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>>>  	p->bus_desc.module = THIS_MODULE;
>>>>>>>  	p->bus_desc.of_node = p->pdev->dev.of_node;
>>>>>>>  	p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL);
>>>>>>> +	p->target_node = dev_to_node(&p->pdev->dev);
>>>>>>>  
>>>>>>>  	/* Set the dimm command family mask to accept PDSMs */
>>>>>>>  	set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask);
>>>>>>> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>>>>>>  	mapping.size = p->blocks * p->block_size; // XXX: potential overflow?
>>>>>>>  
>>>>>>>  	memset(&ndr_desc, 0, sizeof(ndr_desc));
>>>>>>> -	target_nid = dev_to_node(&p->pdev->dev);
>>>>>>> +	target_nid = p->target_node;
>>>>>>>  	online_nid = numa_map_to_online_node(target_nid);
>>>>>>>  	ndr_desc.numa_node = online_nid;
>>>>>>>  	ndr_desc.target_node = target_nid;
>>>>>>> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = {
>>>>>>>  	},
>>>>>>>  };
>>>>>>>  
>>>>>>> +static int papr_scm_callback(struct notifier_block *self,
>>>>>>> +			     unsigned long action, void *arg)
>>>>>>> +{
>>>>>>> +	struct memory_notify *mnb = arg;
>>>>>>> +	int nid = mnb->status_change_nid;
>>>>>>> +	struct papr_scm_priv *p;
>>>>>>> +
>>>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>>>> +		return NOTIFY_OK;
>>>>>>> +
>>>>>>> +	mutex_lock(&papr_ndr_lock);
>>>>>>> +	list_for_each_entry(p, &papr_nd_regions, region_list) {
>>>>>>> +		if (p->target_node == nid) {
>>>>>>> +			node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	mutex_unlock(&papr_ndr_lock);
>>>>>>> +	return NOTIFY_OK;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int __init papr_scm_init(void)
>>>>>>>  {
>>>>>>>  	int ret;
>>>>>>>  
>>>>>>>  	ret = platform_driver_register(&papr_scm_driver);
>>>>>>> -	if (!ret)
>>>>>>> -		mce_register_notifier(&mce_ue_nb);
>>>>>>> -
>>>>>>> -	return ret;
>>>>>>> +	if (ret)
>>>>>>> +		return ret;
>>>>>>> +	mce_register_notifier(&mce_ue_nb);
>>>>>>> +	/*
>>>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>>>> +	 * can update the perf level for the node.
>>>>>>> +	 */
>>>>>>> +	hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>>>> +	return 0;
>>>>>>>  }
>>>>>>>  module_init(papr_scm_init);
>>>>>>>  
>>>>>>> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
>>>>>>> index ae5f4acf2675..7ea1017ef790 100644
>>>>>>> --- a/drivers/acpi/nfit/core.c
>>>>>>> +++ b/drivers/acpi/nfit/core.c
>>>>>>> @@ -15,6 +15,8 @@
>>>>>>>  #include <linux/sort.h>
>>>>>>>  #include <linux/io.h>
>>>>>>>  #include <linux/nd.h>
>>>>>>> +#include <linux/memory.h>
>>>>>>> +#include <linux/memory-tiers.h>
>>>>>>>  #include <asm/cacheflush.h>
>>>>>>>  #include <acpi/nfit.h>
>>>>>>>  #include "intel.h"
>>>>>>> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = {
>>>>>>>  	},
>>>>>>>  };
>>>>>>>  
>>>>>>> +static int nfit_callback(struct notifier_block *self,
>>>>>>> +			 unsigned long action, void *arg)
>>>>>>> +{
>>>>>>> +	bool found = false;
>>>>>>> +	struct memory_notify *mnb = arg;
>>>>>>> +	int nid = mnb->status_change_nid;
>>>>>>> +	struct nfit_spa *nfit_spa;
>>>>>>> +	struct acpi_nfit_desc *acpi_desc;
>>>>>>> +
>>>>>>> +	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
>>>>>>> +		return NOTIFY_OK;
>>>>>>> +
>>>>>>> +	mutex_lock(&acpi_desc_lock);
>>>>>>> +	list_for_each_entry(acpi_desc, &acpi_descs, list) {
>>>>>>> +		mutex_lock(&acpi_desc->init_mutex);
>>>>>>> +		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
>>>>>>> +			struct acpi_nfit_system_address *spa = nfit_spa->spa;
>>>>>>> +			int target_node = pxm_to_node(spa->proximity_domain);
>>>>>>> +
>>>>>>> +			if (target_node == nid) {
>>>>>>> +				node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>>>>> +				found = true;
>>>>>>> +				break;
>>>>>>> +			}
>>>>>>> +		}
>>>>>>> +		mutex_unlock(&acpi_desc->init_mutex);
>>>>>>> +		if (found)
>>>>>>> +			break;
>>>>>>> +	}
>>>>>>> +	mutex_unlock(&acpi_desc_lock);
>>>>>>> +	return NOTIFY_OK;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static __init int nfit_init(void)
>>>>>>>  {
>>>>>>>  	int ret;
>>>>>>> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void)
>>>>>>>  		nfit_mce_unregister();
>>>>>>>  		destroy_workqueue(nfit_wq);
>>>>>>>  	}
>>>>>>> -
>>>>>>> +	/*
>>>>>>> +	 * register a memory hotplug notifier at prio 2 so that we
>>>>>>> +	 * can update the perf level for the node.
>>>>>>> +	 */
>>>>>>> +	hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1);
>>>>>>>  	return ret;
>>>>>>>  
>>>>>>>  }
>>>>>>
>>>>>> I don't think that it's a good idea to set perf_level of a memory device
>>>>>> (node) via NFIT only.
>>>>>
>>>>>>
>>>>>> For example, we may prefer HMAT over NFIT when it's available.  So the
>>>>>> perf_level should be set in dax/kmem.c based on information provided by
>>>>>> ACPI or other information sources.  ACPI can provide some functions/data
>>>>>> structures to let drivers (like dax/kmem.c) to query the properties of
>>>>>> the memory device (node).
>>>>>>
>>>>>
>>>>> I was trying to make it architecture specific so that we have a placeholder
>>>>> to fine-tune this better. For example, ppc64 will look at device tree
>>>>> details to find the performance level and x86 will look at ACPI data structure.
>>>>> Adding that hotplug callback in dax/kmem will prevent that architecture-specific
>>>>> customization? 
>>>>>
>>>>> I would expect that callback to move to the generic ACPI layer so that even
>>>>> firmware managed CXL devices can be added to a lower tier?  I don't understand
>>>>> ACPI enough to find the right abstraction for that hotplug callback. 
>>>> 
>>>> I'm OK for this to be architecture specific.
>>>> 
>>>> But ACPI NFIT isn't enough for x86.  For example, PMEM can be added to a
>>>> virtual machine as normal memory nodes without NFIT.  Instead, PMEM is
>>>> marked via "memmap=<nn>G!<ss>G" or "efi_fake_mem=<nn>G@<ss>G:0x40000",
>>>> and dax/kmem.c is used to hot-add the memory.
>>>> 
>>>> So, before a more sophisticated version is implemented for x86.  The
>>>> simplest version as I suggested below works even better.
>>>> 
>>>>>> As the simplest first version, this can be just hard coded.
>>>>>>
>>>>>
>>>>> If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid]
>>>>> get allocated pretty late when we try to online the node. 
>>>> 
>>>> As the simplest first version, this can be as simple as,
>>>> 
>>>> /* dax/kmem.c */
>>>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>> {
>>>> 	node_devices[dev_dax->target_node]->perf_level = MEMTIER_PERF_LEVEL_PMEM;
>>>> 	/* add_memory_driver_managed() */
>>>> }
>>>> 
>>>> To be compatible with ppc64 version, how about make dev_dax_kmem_probe()
>>>> set perf_level only if it's uninitialized?
>>>
>>> That will result in kernel crash because node_devices[dev_dax->target_node] is not initialized there. 
>>>
>>> it get allocated in add_memory_resource -> __try_online_node ->
>>> register_one_node -> __register_one_node -> node_devices[nid] =
>>> kzalloc(sizeof(struct node), GFP_KERNEL);
>>
>> Ah, right!  So we need some other way to do that, for example, a global
>> array as follows,
>>
>>   int node_perf_levels[MAX_NUMNODES];
>
> This would be much simpler than adding memory_type, but then it is a
> memory device property and hence it will be a good idea to group
> it together with other properties in node_devices[]. We could look at
> allocating nodes_devices[] for dax/kmem nodes from within the driver?

We have other choices too.  For example, we have per-node memory tier
related data structure node_demotion[].  And we have NODE_DATA().  But I
think we can try memory_type firstly.

> I did implement memory_type and it do bring some additional complexity
> though it simplfy the interface. 

Thanks!  Let's look at it.  It's something we need to try sooner or later.

> I was looking at the platform drivers calling
> struct memory_dev_type *register_memory_type(int perf_level, int node)
> to register a new memory_type. if dax/kmem don't find a memory_dev_type
> registered for the NUMA node it will assign default pmem type.
>
> 	if (!node_memory_types[numa_node]) {
> 		/*
> 		 * low level drivers didn't initialize the memory type.
> 		 * assign a default level.
> 		 */
> 		node_memory_types[numa_node] = &default_pmem_type;
> 		node_set(numa_node, default_pmem_type.nodes);
> 	}
>
> This should allow ACPI or papr_scm to fine tune the memory type of
> the deivce they are initializing

I guess that you are trying to coordinate multiple drivers that may
manage the same memory devices?  For example, ACPI NFIT, ACPI HMAT,
dax/kmem may "manage" a PMEM device.  So it's possible for any driver to
set its memory_type and even change it?

To simplify the situation, I suggest that only the driver which will
online the memory node will set the memory_type for the node.  In this
way, we will never change the memory_type of a memory node.  And we will
not change the memory_tier of a memory node during boot.  The driver
which onlines the memory node (e.g., dax/kmem.c) may query ACPI
NFIT/ACPI HMAT or papr_scm to get more information.

We can use a special driver to manage memory nodes onlined during boot.

Because memory_type is per driver, the memory devices that have same
performance, but managed by different drivers can be put in different
memory_type.  So we don't need default_pmem_type.  This is different
from memory_tier, where the memory devices with same performance needs
to be put in one memory_tier.

>>
>> And, I think that we need to consider the memory type here too.  As
>> suggested by Johannes, memory type describes a set of memory devices
>> (NUMA nodes) with same performance character (that is, abstract distance
>> or perf level).  The data structure can be something as follows,
>>
>>   struct memory_type {
>>         int perf_level;
>>         struct list_head tier_sibling;
>>         nodemask_t nodes;
>>   };
>
> memory type is already used in include/linux/memremap.h
>
> enum memory_type 
>
> How about struct memory_dev_type? 

Sound good to me.

> How about we also add memtier here that is only accessed with
> memory_tier_lock held? That will allow easy access to the memory
> tier this type belongs

Who will use it?  If we have no users now, we can add it when there are.

>>
>> The memory_type should be created and managed by the device drivers
>> (e.g., dax/kmem) which manages the memory devices.  In the future, we
>> will represent it in sysfs, and a per-memory_type knob will be provided
>> to offset the perf_level of all memory devices managed by the
>> memory_type.
>>
>> With memory_type, the memory_tier becomes,
>>
>>   struct memory_tier {
>>         int perf_level_start;
>>         struct list_head sibling;
>>         struct list_head memory_types;
>>   };
>>
>> And we need an array to map from nid to memory_type, e.g., as follows,
>>
>>   struct memory_type *node_memory_types[MAX_NUMNODES];
>
> Changing the perf level of a memory devices will move it from one
> memory type to the other and such a move should will also results
> in updating node's memory tier. ie, it will be something like below

I think that we should only change the perf level of a memory_type (so
that all of its memory devices), but never change the perf level of an
individual memory device.  Per my understanding, this was suggested by
Johannes too.

And we don't need to change perf level now too.  It needs to be done via
a user space knob per memory type.

> static void update_node_perf_level(int node, struct memory_dev_type *memtype)
> {
> 	pg_data_t *pgdat;
> 	struct memory_dev_type *current_memtype;
> 	struct memory_tier *memtier;
>
> 	pgdat = NODE_DATA(node);
> 	if (!pgdat)
> 		return;
>
> 	mutex_lock(&memory_tier_lock);
> 	/*
> 	 * Make sure we mark the memtier NULL before we assign the new memory tier
> 	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> 	 * finds a NULL memtier or the one which is still valid.
> 	 */
> 	rcu_assign_pointer(pgdat->memtier, NULL);
> 	synchronize_rcu();
>
> 	if (!memtype->memtier) {
> 		memtier = find_create_memory_tier(memtype);
> 		if (IS_ERR(memtier))
> 			goto err_out;
> 	}
> 	current_memtype = node_memory_types[node];
> 	node_clear(node, current_memtype->nodes);
> 	/*
> 	 * If current memtype becomes empty, we should kill the memory tiers
> 	 */
> 	node_set(node, memtype->nodes);
>         node_memory_types[node] = memtype;
> 	rcu_assign_pointer(pgdat->memtier, memtype->memtier);
> 	establish_migration_targets();
> err_out:
> 	mutex_unlock(&memory_tier_lock);
> }
>
>
>>
>> We need to manage the memory_type in device drivers, instead of ACPI or
>> device tree callbacks.
>>
>> Because memory_type is an important part of the explicit memory tier
>> implementation and may influence the design, I suggest to include it in
>> the implementation now.  It appears not too complex anyway.
>>
>
> One thing I observed while implementing this is the additional
> complexity while walking the memory tiers. Any tier related operation
> impacting memory numa nodes now becomes a two linked list walk as below.
>
> ex:
> list_for_each_entry(memtier, &memory_tiers, list) {
> 	list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling)
> 		nodes_or(nodes, nodes, memtype->nodes);
>
> As we offline N_MEMORY nodes, we now have to do
>
> 	memtype = node_memory_types[node];
>         if (nodes_empty(memtype->nodes)) {
>         	list_del(&memtype->tier_sibiling);
>                         if (list_empty(&current_memtier->memory_types))
>                         	destroy_memory_tier(current_memtier);
>

Yes.  This may increase code complexity.  Let's check the resulting code.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-27  4:38         ` Aneesh Kumar K.V
@ 2022-07-28  6:42           ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-28  6:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Alistair Popple, linux-mm, akpm, Wei Xu, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Dan Williams, Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 7/26/22 9:33 AM, Huang, Ying wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

[snip]

>>>>>  
>>>>> +static struct memory_tier *__node_get_memory_tier(int node)
>>>>> +{
>>>>> +	struct memory_tier *memtier;
>>>>> +
>>>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>>>> +		if (node_isset(node, memtier->nodelist))
>>>>> +			return memtier;
>>>>> +	}
>>>>> +	return NULL;
>>>>> +}
>>>>> +
>>>>> +static void init_node_memory_tier(int node)
>>>> 
>>>> set_node_memory_tier()?
>>>
>>> That was done based on feedback from Alistair 
>>>
>>> https://lore.kernel.org/linux-mm/87h73iapg1.fsf@nvdebian.thelocal
>>>
>>>> 
>>>>> +{
>>>>> +	int perf_level;
>>>>> +	struct memory_tier *memtier;
>>>>> +
>>>>> +	mutex_lock(&memory_tier_lock);
>>>>> +
>>>>> +	memtier = __node_get_memory_tier(node);
>>>>> +	if (!memtier) {
>>>>> +		perf_level = node_devices[node]->perf_level;
>>>>> +		memtier = find_create_memory_tier(perf_level);
>>>>> +		node_set(node, memtier->nodelist);
>>>>> +	}
>>
>> It's related to Alistair's comments too.  When will memtier != NULL
>> here?  We may need just VM_WARN_ON() here?
>
> When the platform driver sets memory tier directly. With the old code
> it can happen when dax/kmem register a node to a memory tier. With
> memory_type proposal this can happen if the node is part of memory
> type that is already added to a memory tier. 

Let's look at what it looks like with memory_type in place.

Best Regards,
Huang, Ying

>>
>>>>> +	mutex_unlock(&memory_tier_lock);
>>>>> +}
>>>>> +

[snip]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-27  4:35         ` Aneesh Kumar K.V
@ 2022-07-28  6:51           ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-07-28  6:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 7/26/22 1:14 PM, Huang, Ying wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>> 
>
> ....
>
>>>> + */
>>>>> +int next_demotion_node(int node)
>>>>> +{
>>>>> +	struct demotion_nodes *nd;
>>>>> +	int target;
>>>>> +
>>>>> +	if (!node_demotion)
>>>>> +		return NUMA_NO_NODE;
>>>>> +
>>>>> +	nd = &node_demotion[node];
>>>>> +
>>>>> +	/*
>>>>> +	 * node_demotion[] is updated without excluding this
>>>>> +	 * function from running.
>>>>> +	 *
>>>>> +	 * Make sure to use RCU over entire code blocks if
>>>>> +	 * node_demotion[] reads need to be consistent.
>>>>> +	 */
>>>>> +	rcu_read_lock();
>>>>> +	/*
>>>>> +	 * If there are multiple target nodes, just select one
>>>>> +	 * target node randomly.
>>>>> +	 *
>>>>> +	 * In addition, we can also use round-robin to select
>>>>> +	 * target node, but we should introduce another variable
>>>>> +	 * for node_demotion[] to record last selected target node,
>>>>> +	 * that may cause cache ping-pong due to the changing of
>>>>> +	 * last target node. Or introducing per-cpu data to avoid
>>>>> +	 * caching issue, which seems more complicated. So selecting
>>>>> +	 * target node randomly seems better until now.
>>>>> +	 */
>>>>> +	target = node_random(&nd->preferred);
>>>> 
>>>> In one of the most common cases, nodes_weight(&nd->preferred) == 1.
>>>> Where, get_random_int() in node_random() just wastes CPU cycles and
>>>> random entropy.  So the original struct demotion_nodes implementation
>>>> appears better.
>>>> 
>>>>   struct demotion_nodes {
>>>>          unsigned short nr;
>>>>          short nodes[DEMOTION_TARGET_NODES];
>>>>   };
>>>> 
>>>
>>>
>>> Is that measurable difference? using nodemask_t makes it much easier with respect to
>>> implementation. IMHO if we observe the usage of node_random() to have performance impact
>>> with nodes_weight() == 1 we should fix node_random() to handle that? If you strongly
>>> feel we should fix this, i can opencode node_random to special case node_weight() == 1?
>>
>> If there's no much difference, why not just use the existing code?
>> IMHO, it's your responsibility to prove your new implementation is
>> better via numbers, for example, reduced code lines, with better or same
>> performance.
>>
>> Another policy is just to use the existing code in the first version.
>> Then change it based on measurement.
>
> One of the reason I switched to nodemask_t is to make code simpler.
> demotion target is essentially a node mask. 
>
>>
>> In general, I care more about the most common cases, that is, 0 or 1
>> demotion target.
>
> How about I switch to the below opencoded version. That should take care
> of the above concern. 

Per my estimation, the performance for 0 or 1 demotion target should be
OK.

And I think that you can change node_random() implementation directly.
Because it will not hurt other users too.

Best Regards,
Huang, Ying

>>
>>> -	target = node_random(&nd->preferred);
>>> +	node_weight = nodes_weight(nd->preferred);
>>> +	switch (node_weight) {
>>> +	case 0:
>>> +		target = NUMA_NO_NODE;
>>> +		break;
>>> +	case 1:
>>> +		target = first_node(nd->preferred);
>>> +		break;
>>> +	default:
>>> +		target = bitmap_ord_to_pos(nd->preferred.bits,
>>> +					   get_random_int() % node_weight, MAX_NUMNODES);
>>> +		break;
>>> +	}
>>>  
>>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-27  1:16       ` Huang, Ying
@ 2022-07-28 17:23         ` Johannes Weiner
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Weiner @ 2022-07-28 17:23 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Aneesh Kumar K V, Wei Xu, linux-mm, akpm, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, jvgediya.oss,
	Jagdish Gediya

On Wed, Jul 27, 2022 at 09:16:08AM +0800, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> > It is an abstract concept that indicates the performance of the
> > device. As we learn more about which device attribute makes more impact in
> > defining hierarchy, performance level will give more weightage to that specific
> > attribute. It could be write latency or bandwidth. For me, distance has a direct
> > linkage to latency because that is how we define numa distance now. Adding
> > abstract to the name is not making it more abstract than perf_level. 
> >
> > I am open to suggestions from others.  Wei Xu has also suggested perf_level name.
> > I can rename this to abstract_distance if that indicates the goal better.
> 
> I'm open to naming.  But I think that it's good to define it at some
> degree instead of completely opaque stuff.  If it's latency based, then
> low value corresponds to high performance.  If it's bandwidth based,
> then low value corresponds to low performance.
> 
> Hi, Wei and Johannes,
> 
> What do you think about this?

I'm also partial to distance. It's a familiar metric in non-uniform
memory for guiding placement decisions, and that is how we continue to
use it here too.

It's historically meant bus latency, but given how the kernel
perceives and acts on the metric IMO the term works just fine to
express differences in bandwidth and chip resonpse times as well.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-27  1:40       ` Huang, Ying
  2022-07-27  4:35         ` Aneesh Kumar K.V
@ 2022-08-03  3:18         ` Aneesh Kumar K.V
  2022-08-04  4:19           ` Huang, Ying
  1 sibling, 1 reply; 39+ messages in thread
From: Aneesh Kumar K.V @ 2022-08-03  3:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
>> On 7/26/22 1:14 PM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

....

>>> +
>>>>  static void init_node_memory_tier(int node)
>>>>  {
>>>>  	int perf_level;
>>>> @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
>>>>  	mutex_lock(&memory_tier_lock);
>>>>  
>>>>  	memtier = __node_get_memory_tier(node);
>>>> +	/*
>>>> +	 * if node is already part of the tier proceed with the
>>>> +	 * current tier value, because we might want to establish
>>>> +	 * new migration paths now. The node might be added to a tier
>>>> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
>>>> +	 * will have skipped this node.
>>>> +	 */
>>>>  	if (!memtier) {
>>>>  		perf_level = node_devices[node]->perf_level;
>>>>  		memtier = find_create_memory_tier(perf_level);
>>>>  		node_set(node, memtier->nodelist);
>>>>  	}
>>>> +	establish_migration_targets();
>>> 
>>> Why combines memory tiers establishing with demotion targets building?
>>> I think that it's better to separate them.   For example, if we move a
>>> set of NUMA node from one memory tier to another memory tier, we only
>>> need to run establish_migration_targets() once after moving all nodes.
>>> 
>>
>> Yes agree. I am not sure I followed your comment here. 
>>
>> Demotion target rebuilding is a separate helper. Any update to memory tiers needs rebuilding
>> of demotion targets. Also any change in node mask of memory tier needs
>> demotion target rebuild. Can you clarify the code change you are suggesting here?
>
> I think we should call establish_migration_targets() in
> migrate_on_reclaim_callback() directly.  As the example I mentioned
> above, sometimes, we don't need to call establish_migration_targets()
> for each node changing.
>

We need to hold memory_tier_lock while updating node's memory tier and
rebuilding demotion targets. All of that is done in the same function
here. An update node memory tier that allow for updating multiple node's
memory tier together would do what you mentioned above under
memory_tier_lock ie, update all the nodes memory tier and then call
establish_migration_targets() once.

-aneesh

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-08-03  3:18         ` Aneesh Kumar K.V
@ 2022-08-04  4:19           ` Huang, Ying
  0 siblings, 0 replies; 39+ messages in thread
From: Huang, Ying @ 2022-08-04  4:19 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

 > "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 7/26/22 1:14 PM, Huang, Ying wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>>>> +
>>>>>  static void init_node_memory_tier(int node)
>>>>>  {
>>>>>  	int perf_level;
>>>>> @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node)
>>>>>  	mutex_lock(&memory_tier_lock);
>>>>>  
>>>>>  	memtier = __node_get_memory_tier(node);
>>>>> +	/*
>>>>> +	 * if node is already part of the tier proceed with the
>>>>> +	 * current tier value, because we might want to establish
>>>>> +	 * new migration paths now. The node might be added to a tier
>>>>> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
>>>>> +	 * will have skipped this node.
>>>>> +	 */
>>>>>  	if (!memtier) {
>>>>>  		perf_level = node_devices[node]->perf_level;
>>>>>  		memtier = find_create_memory_tier(perf_level);
>>>>>  		node_set(node, memtier->nodelist);
>>>>>  	}
>>>>> +	establish_migration_targets();
>>>> 
>>>> Why combines memory tiers establishing with demotion targets building?
>>>> I think that it's better to separate them.   For example, if we move a
>>>> set of NUMA node from one memory tier to another memory tier, we only
>>>> need to run establish_migration_targets() once after moving all nodes.
>>>> 
>>>
>>> Yes agree. I am not sure I followed your comment here. 
>>>
>>> Demotion target rebuilding is a separate helper. Any update to memory tiers needs rebuilding
>>> of demotion targets. Also any change in node mask of memory tier needs
>>> demotion target rebuild. Can you clarify the code change you are suggesting here?
>>
>> I think we should call establish_migration_targets() in
>> migrate_on_reclaim_callback() directly.  As the example I mentioned
>> above, sometimes, we don't need to call establish_migration_targets()
>> for each node changing.
>>
>
> We need to hold memory_tier_lock while updating node's memory tier and
> rebuilding demotion targets. All of that is done in the same function
> here. An update node memory tier that allow for updating multiple node's
> memory tier together would do what you mentioned above under
> memory_tier_lock ie, update all the nodes memory tier and then call
> establish_migration_targets() once.

I don't think it's good to duplicate code unnecessarily.  Managing
memory tiers and estabilishing demotion target are two separate stuff.
We shouldn't combined them.  If memory_tier_lock needs to be held, just
enclosing estabilish_migration_targets() with it.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2022-08-04  4:19 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-20  2:59 [PATCH v10 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-20  2:59 ` [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-26  3:53   ` Huang, Ying
2022-07-26 11:59     ` Aneesh Kumar K V
2022-07-27  1:16       ` Huang, Ying
2022-07-28 17:23         ` Johannes Weiner
2022-07-20  2:59 ` [PATCH v10 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-20  2:59 ` [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-26  4:03   ` Huang, Ying
2022-07-26 12:03     ` Aneesh Kumar K V
2022-07-27  1:53       ` Huang, Ying
2022-07-27  4:38         ` Aneesh Kumar K.V
2022-07-28  6:42           ` Huang, Ying
2022-07-20  2:59 ` [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Aneesh Kumar K.V
2022-07-21  6:07   ` kernel test robot
2022-07-25  6:37   ` Huang, Ying
2022-07-25  6:48     ` Aneesh Kumar K V
2022-07-25  8:35       ` Huang, Ying
2022-07-25  8:42         ` Aneesh Kumar K V
2022-07-26  2:13           ` Huang, Ying
2022-07-27  4:31             ` Aneesh Kumar K.V
2022-07-28  6:39               ` Huang, Ying
2022-07-20  2:59 ` [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-20  3:38   ` Aneesh Kumar K.V
2022-07-21  0:02   ` kernel test robot
2022-07-26  7:44   ` Huang, Ying
2022-07-26 12:30     ` Aneesh Kumar K V
2022-07-27  1:40       ` Huang, Ying
2022-07-27  4:35         ` Aneesh Kumar K.V
2022-07-28  6:51           ` Huang, Ying
2022-08-03  3:18         ` Aneesh Kumar K.V
2022-08-04  4:19           ` Huang, Ying
2022-07-20  2:59 ` [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-26  8:02   ` Huang, Ying
2022-07-20  2:59 ` [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-26  8:24   ` Huang, Ying
2022-07-20  2:59 ` [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-07-25  8:54   ` Huang, Ying
2022-07-25  8:56     ` Aneesh Kumar K V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).