linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/8] mm/demotion: Memory tiers and demotion
@ 2022-07-28 19:04 Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

The current kernel has the basic memory tiering support: Inactive pages on a
higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
make room for new allocations on the higher tier NUMA node. Frequently accessed
pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
node to improve the performance.

In the current kernel, memory tiers are defined implicitly via a demotion path
relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed. The
current implementation puts all nodes with CPU into the top tier, and builds the
tier hierarchy tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for several
important use cases:

* The current tier initialization code always initializes each memory-only NUMA
  node into a lower tier. But a memory-only NUMA node may have a high
  performance memory device (e.g. a DRAM device attached via CXL.mem or a
  DRAM-backed memory-only node on a virtual machine) and should be put into a
  higher tier.

* The current tier hierarchy always puts CPU nodes into the top tier. But on a
  system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
  should be in the top tier, and DRAM nodes with CPUs are better to be placed
  into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes into the top
  tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
  CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
  changed, even though no memory node is added or removed. This can make the
  tier hierarchy unstable and make it difficult to support tier-based memory
  accounting.

* A higher tier node can only be demoted to selected nodes on the next lower
  tier as defined by the demotion path, not any other node from any lower tier.
  This strict, hard-coded demotion order does not work in all use cases (e.g.
  some use cases may want to allow cross-socket demotion to another node in the
  same demotion tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to override
  the system-wide, per-node demotion order from the userspace. This demotion
  order is also inconsistent with the page allocation fallback order when all
  the nodes in a higher tier are out of space: The page allocation can fall back
  to any node from any lower tier, whereas the demotion order doesn't allow
  that.

This patch series make the creation of memory tiers explicit under
the control of device driver.

Memory Tier Initialization
==========================

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its abstract 
distance. A memory tier corresponds to a range of abstract distance. This allows
for classifying memory devices with a specific performance range into a memory
tier.

By default, all memory nodes are assigned to the default tier with
abstract distance 512.

A device driver can move its memory nodes from the default tier. For example,
PMEM can move its memory nodes below the default tier, whereas GPU can move its
memory nodes above the default tier.

The kernel initialization code makes the decision on which exact tier a memory
node should be assigned to based on the requests from the device drivers as well
as the memory device hardware information provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Changes from v10:
* rename performance level to abstract distance
* Thanks to all the good feedback from Huang, Ying <ying.huang@intel.com>.
  Updated the patchset to cover most of the review feedback.

Changes from v9:
* Use performance level for initializing memory tiers.

Changes from v8:
* Drop the sysfs interface patches and  related documentation changes.

Changes from v7:
* Fix kernel crash with demotion.
* Improve documentation.

Changes from v6:
* Drop the usage of rank.
* Address other review feedback.

Changes from v5:
* Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
  are going to be used for features other than demotion. Hence keep all N_MEMORY
  nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
* Add NODE_DATA->memtier
* Rearrage patches to add sysfs files later.
* Add support to create memory tiers from userspace.
* Address other review feedback.


Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.



Aneesh Kumar K.V (7):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Move memory demotion related code
  mm/demotion: Add hotplug callbacks to handle new numa node onlined
  mm/demotion/dax/kmem: Set node's abstract distance to
    MEMTIER_ADISTANCE_PMEM
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion: Add pg_data_t member to track node memory tier details
  mm/demotion: Update node_is_toptier to work with memory tiers

Jagdish Gediya (1):
  mm/demotion: Demote pages according to allocation fallback order

 drivers/dax/kmem.c           |   9 +
 include/linux/memory-tiers.h |  79 +++++
 include/linux/migrate.h      |  15 -
 include/linux/mmzone.h       |   3 +
 include/linux/node.h         |   5 -
 mm/Makefile                  |   1 +
 mm/huge_memory.c             |   1 +
 mm/memory-tiers.c            | 586 +++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 453 +--------------------------
 mm/mprotect.c                |   1 +
 mm/vmscan.c                  |  59 +++-
 mm/vmstat.c                  |   4 -
 12 files changed, 725 insertions(+), 491 deletions(-)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

-- 
2.37.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-29  6:25   ` Huang, Ying
       [not found]   ` <62e890da7f784_577a029473@dwillia2-xfh.jf.intel.com.notmuch>
  2022-07-28 19:04 ` [PATCH v11 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya

In the current kernel, memory tiers are defined implicitly via a demotion path
relationship between NUMA nodes, which is created during the kernel
initialization and updated when a NUMA node is hot-added or hot-removed. The
current implementation puts all nodes with CPU into the highest tier, and builds
the tier hierarchy tier-by-tier by establishing the per-node demotion targets
based on the distances between nodes.

This current memory tier kernel implementation needs to be improved for several
important use cases,

The current tier initialization code always initializes each memory-only NUMA
node into a lower tier. But a memory-only NUMA node may have a high performance
memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top tier. But on a
system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
should be in the top tier, and DRAM nodes with CPUs are better to be placed into
the next lower tier.

With current kernel higher tier node can only be demoted to nodes with shortest
distance on the next lower tier as defined by the demotion path, not any other
node from any lower tier. This strict, demotion order does not work in all use
cases (e.g. some use cases may want to allow cross-socket demotion to another
node in the same demotion tier as a fallback when the preferred demotion node is
out of space), This demotion order is also inconsistent with the page allocation
fallback order when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas the demotion
order doesn't allow that.

This patch series address the above by defining memory tiers explicitly.

Linux kernel presents memory devices as NUMA nodes and each memory device is of
a specific type. The memory type of a device is represented by its abstract
distance. A memory tier corresponds to a range of abstract distance. This allows
for classifying memory devices with a specific performance range into a memory
tier.

This patch configures the range/chunk size to be 128. The default DRAM
abstract distance is 512. We can have 4 memory tiers below the default DRAM
abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
Slower memory devices like persistent memory will have abstract distance below
the default DRAM level and hence will be placed in these 4 lower tiers.

A kernel parameter is provided to override the default memory tier.

Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  17 ++++++
 mm/Makefile                  |   1 +
 mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..8d7884b7a3f0
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+/*
+ * Each tier cover a abstrace distance chunk size of 128
+ */
+#define MEMTIER_CHUNK_BITS	7
+#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
+/*
+ * For now let's have 4 memory tier below default DRAM tier.
+ */
+#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
+/* leave one tier below this slow pmem */
+#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
+
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..01cfd514c192
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	/* hierarchy of memory tiers */
+	struct list_head list;
+	/* list of all memory types part of this tier */
+	struct list_head memory_types;
+	/*
+	 * start value of abstract distance. memory tier maps
+	 * an abstract distance  range,
+	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
+	 */
+	int adistance_start;
+};
+
+struct memory_dev_type {
+	/* list of memory types that are are part of same tier as this type */
+	struct list_head tier_sibiling;
+	/* abstract distance for this specific memory type */
+	int adistance;
+	/* Nodes of same abstract distance */
+	nodemask_t nodes;
+	struct memory_tier *memtier;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+struct memory_dev_type *node_memory_types[MAX_NUMNODES];
+/*
+ * For now let's have 4 memory tier below default DRAM tier.
+ */
+static struct memory_dev_type default_dram_type  = {
+	.adistance = MEMTIER_ADISTANCE_DRAM,
+	.tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
+};
+
+static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
+{
+	bool found_slot = false;
+	struct memory_tier *memtier, *new_memtier;
+	int adistance = memtype->adistance;
+	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	/*
+	 * If the memtype is already part of a memory tier,
+	 * just return that.
+	 */
+	if (memtype->memtier)
+		return memtype->memtier;
+
+	adistance = round_down(adistance, memtier_adistance_chunk_size);
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (adistance == memtier->adistance_start) {
+			memtype->memtier = memtier;
+			list_add(&memtype->tier_sibiling, &memtier->memory_types);
+			return memtier;
+		} else if (adistance < memtier->adistance_start) {
+			found_slot = true;
+			break;
+		}
+	}
+
+	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!new_memtier)
+		return ERR_PTR(-ENOMEM);
+
+	new_memtier->adistance_start = adistance;
+	INIT_LIST_HEAD(&new_memtier->list);
+	INIT_LIST_HEAD(&new_memtier->memory_types);
+	if (found_slot)
+		list_add_tail(&new_memtier->list, &memtier->list);
+	else
+		list_add_tail(&new_memtier->list, &memory_tiers);
+	memtype->memtier = new_memtier;
+	list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
+	return new_memtier;
+}
+
+static int __init memory_tier_init(void)
+{
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+	/* CPU only nodes are not part of memory tiers. */
+	default_dram_type.nodes = node_states[N_MEMORY];
+
+	memtier = find_create_memory_tier(&default_dram_type);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+	mutex_unlock(&memory_tier_lock);
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 2/8] mm/demotion: Move memory demotion related code
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This move memory demotion related code to mm/memory-tiers.c.
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  8 +++++
 include/linux/migrate.h      |  2 --
 mm/memory-tiers.c            | 62 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 60 +---------------------------------
 mm/vmscan.c                  |  1 +
 5 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 8d7884b7a3f0..b85901c0caba 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -14,4 +14,12 @@
 /* leave one tier below this slow pmem */
 #define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
 
+#ifdef CONFIG_NUMA
+#include <linux/types.h>
+extern bool numa_demotion_enabled;
+
+#else
+
+#define numa_demotion_enabled	false
+#endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..43e737215f33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
 extern void set_migration_target_nodes(void);
 extern void migrate_on_reclaim_init(void);
-extern bool numa_demotion_enabled;
 extern int next_demotion_node(int node);
 #else
 static inline void set_migration_target_nodes(void) {}
@@ -87,7 +86,6 @@ static inline int next_demotion_node(int node)
 {
         return NUMA_NO_NODE;
 }
-#define numa_demotion_enabled  false
 #endif
 
 #ifdef CONFIG_COMPACTION
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 01cfd514c192..03e43f3dc942 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -100,3 +100,65 @@ static int __init memory_tier_init(void)
 	return 0;
 }
 subsys_initcall(memory_tier_init);
+
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_demotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif /* CONFIG_SYSFS */
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c1ea61f39d8..fce7d4a9e940 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void)
 	set_migration_target_nodes();
 	cpus_read_unlock();
 }
+#endif /* CONFIG_NUMA */
 
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled ? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	ssize_t ret;
-
-	ret = kstrtobool(buf, &numa_demotion_enabled);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
 
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif /* CONFIG_SYSFS */
-#endif /* CONFIG_NUMA */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7..3a8f78277f99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

If the new NUMA node onlined doesn't have a abstract distance assigned,
the kernel adds the NUMA node to default memory tier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  1 +
 mm/memory-tiers.c            | 83 ++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index b85901c0caba..976f43a5e3be 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -13,6 +13,7 @@
 #define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
 /* leave one tier below this slow pmem */
 #define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
+#define MEMTIER_HOTPLUG_PRIO	100
 
 #ifdef CONFIG_NUMA
 #include <linux/types.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 03e43f3dc942..c9854a394d9b 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -3,6 +3,7 @@
 #include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/lockdep.h>
+#include <linux/memory.h>
 #include <linux/memory-tiers.h>
 
 struct memory_tier {
@@ -83,6 +84,87 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
 	return new_memtier;
 }
 
+static struct memory_tier *__node_get_memory_tier(int node)
+{
+	struct memory_dev_type *memtype;
+
+	memtype = node_memory_types[node];
+	if (memtype)
+		return memtype->memtier;
+	return NULL;
+}
+
+static void init_node_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+
+	memtier = __node_get_memory_tier(node);
+	if (!memtier) {
+		struct memory_dev_type *memtype;
+
+		if (!node_memory_types[node]) {
+			node_memory_types[node] = &default_dram_type;
+			node_set(node, default_dram_type.nodes);
+		}
+		memtype = node_memory_types[node];
+		memtier = find_create_memory_tier(memtype);
+	}
+	mutex_unlock(&memory_tier_lock);
+}
+
+static void destroy_memory_tier(struct memory_tier *memtier)
+{
+	list_del(&memtier->list);
+	kfree(memtier);
+}
+
+static void clear_node_memory_tier(int node)
+{
+	struct memory_tier *current_memtier;
+
+	mutex_lock(&memory_tier_lock);
+	current_memtier = __node_get_memory_tier(node);
+	if (current_memtier) {
+		struct memory_dev_type *memtype;
+
+		memtype = node_memory_types[node];
+		node_clear(node, memtype->nodes);
+		if (nodes_empty(memtype->nodes)) {
+			list_del(&memtype->tier_sibiling);
+			memtype->memtier = NULL;
+			if (list_empty(&current_memtier->memory_types))
+				destroy_memory_tier(current_memtier);
+		}
+	}
+	mutex_unlock(&memory_tier_lock);
+}
+
+static int __meminit memtier_hotplug_callback(struct notifier_block *self,
+					      unsigned long action, void *_arg)
+{
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case MEM_OFFLINE:
+		clear_node_memory_tier(arg->status_change_nid);
+		break;
+	case MEM_ONLINE:
+		init_node_memory_tier(arg->status_change_nid);
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
 static int __init memory_tier_init(void)
 {
 	struct memory_tier *memtier;
@@ -97,6 +179,7 @@ static int __init memory_tier_init(void)
 		      __func__, PTR_ERR(memtier));
 	mutex_unlock(&memory_tier_lock);
 
+	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO);
 	return 0;
 }
 subsys_initcall(memory_tier_init);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-29  6:20   ` Huang, Ying
  2022-07-28 19:04 ` [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

By default, all nodes are assigned to the default memory tier which
is the memory tier designated for nodes with DRAM

Set dax kmem device node's tier to slower memory tier by assigning
abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
appears below the default memory tier in demotion order.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c           |  9 +++++++++
 include/linux/memory-tiers.h | 19 ++++++++++++++++++-
 mm/memory-tiers.c            | 28 ++++++++++++++++------------
 3 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..6b0d5de9a3e9 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/memory-tiers.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -41,6 +42,12 @@ struct dax_kmem_data {
 	struct resource *res[];
 };
 
+static struct memory_dev_type default_pmem_type  = {
+	.adistance = MEMTIER_ADISTANCE_PMEM,
+	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
+	.nodes  = NODE_MASK_NONE,
+};
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
@@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		return -EINVAL;
 	}
 
+	init_node_memory_type(numa_node, &default_pmem_type);
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 976f43a5e3be..4f4baf0bf430 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_MEMORY_TIERS_H
 #define _LINUX_MEMORY_TIERS_H
 
+#include <linux/types.h>
+#include <linux/nodemask.h>
 /*
  * Each tier cover a abstrace distance chunk size of 128
  */
@@ -15,12 +17,27 @@
 #define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
 #define MEMTIER_HOTPLUG_PRIO	100
 
+struct memory_tier;
+struct memory_dev_type {
+	/* list of memory types that are are part of same tier as this type */
+	struct list_head tier_sibiling;
+	/* abstract distance for this specific memory type */
+	int adistance;
+	/* Nodes of same abstract distance */
+	nodemask_t nodes;
+	struct memory_tier *memtier;
+};
+
 #ifdef CONFIG_NUMA
-#include <linux/types.h>
 extern bool numa_demotion_enabled;
+struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type);
 
 #else
 
 #define numa_demotion_enabled	false
+static inline struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index c9854a394d9b..109be75fa554 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -1,6 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <linux/types.h>
-#include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/lockdep.h>
 #include <linux/memory.h>
@@ -19,16 +17,6 @@ struct memory_tier {
 	int adistance_start;
 };
 
-struct memory_dev_type {
-	/* list of memory types that are are part of same tier as this type */
-	struct list_head tier_sibiling;
-	/* abstract distance for this specific memory type */
-	int adistance;
-	/* Nodes of same abstract distance */
-	nodemask_t nodes;
-	struct memory_tier *memtier;
-};
-
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 struct memory_dev_type *node_memory_types[MAX_NUMNODES];
@@ -141,6 +129,22 @@ static void clear_node_memory_tier(int node)
 	mutex_unlock(&memory_tier_lock);
 }
 
+struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type)
+{
+	struct memory_dev_type *mem_type;
+
+	mutex_lock(&memory_tier_lock);
+	if (node_memory_types[node]) {
+		mem_type = node_memory_types[node];
+	} else {
+		node_memory_types[node] = default_type;
+		node_set(node, default_type->nodes);
+		mem_type = default_type;
+	}
+	mutex_unlock(&memory_tier_lock);
+	return mem_type;
+}
+
 static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-29  6:35   ` Huang, Ying
  2022-07-28 19:04 ` [PATCH v11 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  13 ++
 include/linux/migrate.h      |  13 --
 mm/memory-tiers.c            | 221 +++++++++++++++++++-
 mm/migrate.c                 | 394 -----------------------------------
 mm/vmstat.c                  |   4 -
 5 files changed, 233 insertions(+), 412 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 4f4baf0bf430..e56a57c6ef78 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -31,6 +31,14 @@ struct memory_dev_type {
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
 struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type);
+#ifdef CONFIG_MIGRATION
+int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
 
 #else
 
@@ -39,5 +47,10 @@ static inline struct memory_dev_type *init_node_memory_type(int node, struct mem
 {
 	return ERR_PTR(-EINVAL);
 }
+
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 43e737215f33..93fab62e6548 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #endif /* CONFIG_MIGRATION */
 
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
-        return NUMA_NO_NODE;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 extern int PageMovable(struct page *page);
 extern void __SetPageMovable(struct page *page, struct address_space *mapping);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 109be75fa554..60845aa74afc 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -2,8 +2,11 @@
 #include <linux/slab.h>
 #include <linux/lockdep.h>
 #include <linux/memory.h>
+#include <linux/random.h>
 #include <linux/memory-tiers.h>
 
+#include "internal.h"
+
 struct memory_tier {
 	/* hierarchy of memory tiers */
 	struct list_head list;
@@ -17,9 +20,74 @@ struct memory_tier {
 	int adistance_start;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 struct memory_dev_type *node_memory_types[MAX_NUMNODES];
+#ifdef CONFIG_MIGRATION
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers0 = 0-1
+ * memory_tiers1 = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers0 = 0-2
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers0 = 1
+ * memory_tiers1 = 0
+ * memory_tiers2 = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
+#endif /* CONFIG_MIGRATION */
+
 /*
  * For now let's have 4 memory tier below default DRAM tier.
  */
@@ -82,6 +150,144 @@ static struct memory_tier *__node_get_memory_tier(int node)
 	return NULL;
 }
 
+#ifdef CONFIG_MIGRATION
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	struct demotion_nodes *nd;
+	int target;
+
+	if (!node_demotion)
+		return NUMA_NO_NODE;
+
+	nd = &node_demotion[node];
+
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
+	 */
+	rcu_read_lock();
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	target = node_random(&nd->preferred);
+	rcu_read_unlock();
+
+	return target;
+}
+
+static void disable_all_demotion_targets(void)
+{
+	int node;
+
+	for_each_node_state(node, N_MEMORY)
+		node_demotion[node].preferred = NODE_MASK_NONE;
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 */
+	synchronize_rcu();
+}
+
+static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier)
+{
+	nodemask_t nodes = NODE_MASK_NONE;
+	struct memory_dev_type *memtype;
+
+	list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling)
+		nodes_or(nodes, nodes, memtype->nodes);
+
+	return nodes;
+}
+
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_demotion_targets(void)
+{
+	struct memory_tier *memtier;
+	struct demotion_nodes *nd;
+	int target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t tier_nodes;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+		return;
+
+	disable_all_demotion_targets();
+
+	for_each_node_state(node, N_MEMORY) {
+		best_distance = -1;
+		nd = &node_demotion[node];
+
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_first(&memtier->list, &memory_tiers))
+			continue;
+		/*
+		 * Get the lower memtier to find the  demotion node list.
+		 */
+		memtier = list_prev_entry(memtier, list);
+		tier_nodes = get_memtier_nodemask(memtier);
+		/*
+		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * Add all memory nodes except the selected memory tier
+		 * nodelist to skip list so that we find the best node from the
+		 * memtier nodelist.
+		 */
+		nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes);
+
+		/*
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
+		 */
+		do {
+			target = find_next_best_node(node, &tier_nodes);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
+		} while (1);
+	}
+}
+
+#else
+static inline void disable_all_demotion_targets(void) {}
+static inline void establish_demotion_targets(void) {}
+#endif /* CONFIG_MIGRATION */
+
 static void init_node_memory_tier(int node)
 {
 	struct memory_tier *memtier;
@@ -89,6 +295,13 @@ static void init_node_memory_tier(int node)
 	mutex_lock(&memory_tier_lock);
 
 	memtier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_demotion_targets
+	 * will have skipped this node.
+	 */
 	if (!memtier) {
 		struct memory_dev_type *memtype;
 
@@ -99,6 +312,7 @@ static void init_node_memory_tier(int node)
 		memtype = node_memory_types[node];
 		memtier = find_create_memory_tier(memtype);
 	}
+	establish_demotion_targets();
 	mutex_unlock(&memory_tier_lock);
 }
 
@@ -125,6 +339,7 @@ static void clear_node_memory_tier(int node)
 			if (list_empty(&current_memtier->memory_types))
 				destroy_memory_tier(current_memtier);
 		}
+		establish_demotion_targets();
 	}
 	mutex_unlock(&memory_tier_lock);
 }
@@ -182,7 +397,11 @@ static int __init memory_tier_init(void)
 		panic("%s() failed to register memory tier: %ld\n",
 		      __func__, PTR_ERR(memtier));
 	mutex_unlock(&memory_tier_lock);
-
+#ifdef CONFIG_MIGRATION
+	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+				GFP_KERNEL);
+	WARN_ON(!node_demotion);
+#endif
 	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO);
 	return 0;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index fce7d4a9e940..c758c9c21d7d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
-	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	/*
-	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
-	 *
-	 * Make sure to use RCU over entire code blocks if
-	 * node_demotion[] reads need to be consistent.
-	 */
-	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
-
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
-
-	target = READ_ONCE(nd->nodes[index]);
-
-out:
-	rcu_read_unlock();
-	return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
-	int node, i;
-
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
-}
-
-static void disable_all_migrate_targets(void)
-{
-	__disable_all_migrate_targets();
-
-	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
-{
-	int migration_target, index, val;
-	struct demotion_nodes *nd;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
-
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass;
-	nodemask_t this_pass;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
-
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
-
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
-
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
-
-		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
-		 */
-		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
-				break;
-
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
-		} while (1);
-	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *_arg)
-{
-	struct memory_notify *arg = _arg;
-
-	/*
-	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
-	 */
-	if (arg->status_change_nid < 0)
-		return notifier_from_errno(0);
-
-	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
-	case MEM_OFFLINE:
-	case MEM_ONLINE:
-		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_CANCEL_OFFLINE:
-		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		break;
-	}
-
-	return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);
-	WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
-}
 #endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..35c6ff97cf29 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
-#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-	migrate_on_reclaim_init();
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

Also update different helpes to use NODE_DATA()->memtier. Since
node specific memtier can change based on the reassignment of
NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
needs to happen under an rcu read lock or memory_tier_lock.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/mmzone.h |  3 +++
 mm/memory-tiers.c      | 40 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..353812495a70 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_NUMA
+	struct memory_tier __rcu *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 60845aa74afc..f982ca6b3559 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -3,6 +3,7 @@
 #include <linux/lockdep.h>
 #include <linux/memory.h>
 #include <linux/random.h>
+#include <linux/mmzone.h>
 #include <linux/memory-tiers.h>
 
 #include "internal.h"
@@ -142,12 +143,18 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
 
 static struct memory_tier *__node_get_memory_tier(int node)
 {
-	struct memory_dev_type *memtype;
+	pg_data_t *pgdat;
 
-	memtype = node_memory_types[node];
-	if (memtype)
-		return memtype->memtier;
-	return NULL;
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return NULL;
+	/*
+	 * Since we hold memory_tier_lock, we can avoid
+	 * RCU read locks when accessing the details. No
+	 * parallel updates are possible here.
+	 */
+	return rcu_dereference_check(pgdat->memtier,
+				     lockdep_is_held(&memory_tier_lock));
 }
 
 #ifdef CONFIG_MIGRATION
@@ -290,6 +297,7 @@ static inline void establish_demotion_targets(void) {}
 
 static void init_node_memory_tier(int node)
 {
+	pg_data_t *pgdat = NODE_DATA(node);
 	struct memory_tier *memtier;
 
 	mutex_lock(&memory_tier_lock);
@@ -311,8 +319,12 @@ static void init_node_memory_tier(int node)
 		}
 		memtype = node_memory_types[node];
 		memtier = find_create_memory_tier(memtype);
+		if (IS_ERR(memtier))
+			goto err_out;
+		rcu_assign_pointer(pgdat->memtier, memtier);
 	}
 	establish_demotion_targets();
+err_out:
 	mutex_unlock(&memory_tier_lock);
 }
 
@@ -324,13 +336,26 @@ static void destroy_memory_tier(struct memory_tier *memtier)
 
 static void clear_node_memory_tier(int node)
 {
+	pg_data_t *pgdat;
 	struct memory_tier *current_memtier;
 
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+
 	mutex_lock(&memory_tier_lock);
+	/*
+	 * Make sure that anybody looking at NODE_DATA who finds
+	 * a valid memtier finds memory_dev_types with nodes still
+	 * linked to the memtier. We achieve this by waiting for
+	 * rcu read section to finish using synchronize_rcu.
+	 */
 	current_memtier = __node_get_memory_tier(node);
 	if (current_memtier) {
 		struct memory_dev_type *memtype;
 
+		rcu_assign_pointer(pgdat->memtier, NULL);
+		synchronize_rcu();
 		memtype = node_memory_types[node];
 		node_clear(node, memtype->nodes);
 		if (nodes_empty(memtype->nodes)) {
@@ -386,6 +411,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 
 static int __init memory_tier_init(void)
 {
+	int node;
 	struct memory_tier *memtier;
 
 	mutex_lock(&memory_tier_lock);
@@ -396,6 +422,10 @@ static int __init memory_tier_init(void)
 	if (IS_ERR(memtier))
 		panic("%s() failed to register memory tier: %ld\n",
 		      __func__, PTR_ERR(memtier));
+
+	for_each_node_state(node, N_MEMORY)
+		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
+
 	mutex_unlock(&memory_tier_lock);
 #ifdef CONFIG_MIGRATION
 	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 7/8] mm/demotion: Demote pages according to allocation fallback order
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-28 19:04 ` [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
  2022-07-29  5:30 ` [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Huang, Ying
  8 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya,
	Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya.oss@gmail.com>

Currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path.
This strict, hard-coded demotion order does not work in all
use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback
when the preferred demotion node is out of space). This demotion
order is also inconsistent with the page allocation fallback order
when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas
the demotion order doesn't allow that currently.

This patch adds support to get all the allowed demotion targets
for a memory tier. demote_page_list() function is now modified
to utilize this allowed node mask as the fallback allocation mask.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 12 ++++++++
 mm/memory-tiers.c            | 50 +++++++++++++++++++++++++++++--
 mm/vmscan.c                  | 58 ++++++++++++++++++++++++++----------
 3 files changed, 102 insertions(+), 18 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index e56a57c6ef78..f8dbeda617a7 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/nodemask.h>
+#include <linux/mmzone.h>
 /*
  * Each tier cover a abstrace distance chunk size of 128
  */
@@ -33,11 +34,17 @@ extern bool numa_demotion_enabled;
 struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 #else
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif
 
 #else
@@ -52,5 +59,10 @@ static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index f982ca6b3559..84e2be31a853 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -3,7 +3,6 @@
 #include <linux/lockdep.h>
 #include <linux/memory.h>
 #include <linux/random.h>
-#include <linux/mmzone.h>
 #include <linux/memory-tiers.h>
 
 #include "internal.h"
@@ -19,6 +18,8 @@ struct memory_tier {
 	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
 	 */
 	int adistance_start;
+	/* All the nodes that are part of all the lower memory tiers. */
+	nodemask_t lower_tier_mask;
 };
 
 struct demotion_nodes {
@@ -158,6 +159,24 @@ static struct memory_tier *__node_get_memory_tier(int node)
 }
 
 #ifdef CONFIG_MIGRATION
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * pg_data_t.memtier updates includes a synchronize_rcu()
+	 * which ensures that we either find NULL or a valid memtier
+	 * in NODE_DATA. protect the access via rcu_read_lock();
+	 */
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (memtier)
+		*targets = memtier->lower_tier_mask;
+	else
+		*targets = NODE_MASK_NONE;
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -205,10 +224,18 @@ int next_demotion_node(int node)
 
 static void disable_all_demotion_targets(void)
 {
+	struct memory_tier *memtier;
 	int node;
 
-	for_each_node_state(node, N_MEMORY)
+	for_each_node_state(node, N_MEMORY) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		/*
+		 * We are holding memory_tier_lock, it is safe
+		 * to access pgda->memtier.
+		 */
+		memtier = __node_get_memory_tier(node);
+		memtier->lower_tier_mask = NODE_MASK_NONE;
+	}
 	/*
 	 * Ensure that the "disable" is visible across the system.
 	 * Readers will see either a combination of before+disable
@@ -240,7 +267,7 @@ static void establish_demotion_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t tier_nodes;
+	nodemask_t tier_nodes, lower_tier;
 
 	lockdep_assert_held_once(&memory_tier_lock);
 
@@ -288,6 +315,23 @@ static void establish_demotion_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Now build the lower_tier mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	lower_tier = node_states[N_MEMORY];
+	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from lower_tier nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the lower_tier mask.
+		 */
+		tier_nodes = get_memtier_nodemask(memtier);
+		nodes_andnot(lower_tier, lower_tier, tier_nodes);
+		memtier->lower_tier_mask = lower_tier;
+	}
 }
 
 #else
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..303c93958371 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
+	struct page *target_page;
+	nodemask_t *allowed_mask;
+	struct migration_target_control *mtc;
+
+	mtc = (struct migration_target_control *)private;
+
+	allowed_mask = mtc->nmask;
+	/*
+	 * make sure we allocate from the target node first also trying to
+	 * demote or reclaim pages from the target node via kswapd if we are
+	 * low on free memory on target node. If we don't do this and if
+	 * we have free memory on the slower(lower) memtier, we would start
+	 * allocating pages from slower(lower) memory tiers without even forcing
+	 * a demotion of cold pages from the target memtier. This can result
+	 * in the kernel placing hot pages in slower(lower) memory tiers.
+	 */
+	mtc->nmask = NULL;
+	mtc->gfp_mask |= __GFP_THISNODE;
+	target_page = alloc_migration_target(page, (unsigned long)mtc);
+	if (target_page)
+		return target_page;
 
-	return alloc_migration_target(page, (unsigned long)&mtc);
+	mtc->gfp_mask &= ~__GFP_THISNODE;
+	mtc->nmask = allowed_mask;
+
+	return alloc_migration_target(page, (unsigned long)mtc);
 }
 
 /*
@@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-07-28 19:04 ` Aneesh Kumar K.V
  2022-07-29  6:39   ` Huang, Ying
  2022-07-29  5:30 ` [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Huang, Ying
  8 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-28 19:04 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

With memory tiers support we can have memory only NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers.
All NUMA nodes are by default top tier nodes. With lower memory
tiers added we consider all memory tiers above a memory tier having
CPU NUMA nodes as a top memory tier

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 11 ++++++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index f8dbeda617a7..bc9fb9d39b2c 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -35,6 +35,7 @@ struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
+bool node_is_toptier(int node);
 #else
 static inline int next_demotion_node(int node)
 {
@@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif
 
 #else
@@ -64,5 +70,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 834f288b3769..8405662646e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 84e2be31a853..36d87dc422ab 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -30,6 +30,7 @@ static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 struct memory_dev_type *node_memory_types[MAX_NUMNODES];
 #ifdef CONFIG_MIGRATION
+static int top_tier_adistance;
 /*
  * node_demotion[] examples:
  *
@@ -159,6 +160,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
 }
 
 #ifdef CONFIG_MIGRATION
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->adistance_start >= top_tier_adistance)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 {
 	struct memory_tier *memtier;
@@ -315,6 +341,22 @@ static void establish_demotion_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Promotion is allowed from a memory tier to higher
+	 * memory tier only if the memory tier doesn't include
+	 * compute. We want to  skip promotion from a memory tier,
+	 * if any node that is  part of the memory tier have CPUs.
+	 * Once we detect such a memory tier, we consider that tier
+	 * as top tiper from which promotion on is not allowed.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		tier_nodes = get_memtier_nodemask(memtier);
+		nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
+		if (!nodes_empty(tier_nodes)) {
+			top_tier_adistance = memtier->adistance_start;
+			break;
+		}
+	}
 	/*
 	 * Now build the lower_tier mask for each node collecting node mask from
 	 * all memory tier below it. This allows us to fallback demotion page
diff --git a/mm/migrate.c b/mm/migrate.c
index c758c9c21d7d..1da81136eaaa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 0/8] mm/demotion: Memory tiers and demotion
  2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2022-07-28 19:04 ` [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-07-29  5:30 ` Huang, Ying
  2022-07-29  6:17   ` Aneesh Kumar K.V
  8 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-07-29  5:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> The current kernel has the basic memory tiering support: Inactive pages on a
> higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
> make room for new allocations on the higher tier NUMA node. Frequently accessed
> pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
> node to improve the performance.
>
> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the top tier, and builds the
> tier hierarchy tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for several
> important use cases:
>
> * The current tier initialization code always initializes each memory-only NUMA
>   node into a lower tier. But a memory-only NUMA node may have a high
>   performance memory device (e.g. a DRAM device attached via CXL.mem or a
>   DRAM-backed memory-only node on a virtual machine) and should be put into a
>   higher tier.
>
> * The current tier hierarchy always puts CPU nodes into the top tier. But on a
>   system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
>   should be in the top tier, and DRAM nodes with CPUs are better to be placed
>   into the next lower tier.
>
> * Also because the current tier hierarchy always puts CPU nodes into the top
>   tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
>   CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
>   changed, even though no memory node is added or removed. This can make the
>   tier hierarchy unstable and make it difficult to support tier-based memory
>   accounting.
>
> * A higher tier node can only be demoted to selected nodes on the next lower
>   tier as defined by the demotion path, not any other node from any lower tier.
>   This strict, hard-coded demotion order does not work in all use cases (e.g.
>   some use cases may want to allow cross-socket demotion to another node in the
>   same demotion tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to override
>   the system-wide, per-node demotion order from the userspace. This demotion
>   order is also inconsistent with the page allocation fallback order when all
>   the nodes in a higher tier are out of space: The page allocation can fall back
>   to any node from any lower tier, whereas the demotion order doesn't allow
>   that.
>
> This patch series make the creation of memory tiers explicit under
> the control of device driver.
>
> Memory Tier Initialization
> ==========================
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract 
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> By default, all memory nodes are assigned to the default tier with
> abstract distance 512.
>
> A device driver can move its memory nodes from the default tier. For example,
> PMEM can move its memory nodes below the default tier, whereas GPU can move its
> memory nodes above the default tier.
>
> The kernel initialization code makes the decision on which exact tier a memory
> node should be assigned to based on the requests from the device drivers as well
> as the memory device hardware information provided by the firmware.
>
> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Some patch description of [0/8] is same as that of [1/8] originally.  It
appears that you revised [1/8], but forget to revise [0/8] too.  Please
do that.

Best Regards,
Huang, Ying

> Changes from v10:
> * rename performance level to abstract distance
> * Thanks to all the good feedback from Huang, Ying <ying.huang@intel.com>.
>   Updated the patchset to cover most of the review feedback.
>
> Changes from v9:
> * Use performance level for initializing memory tiers.
>
> Changes from v8:
> * Drop the sysfs interface patches and  related documentation changes.
>
> Changes from v7:
> * Fix kernel crash with demotion.
> * Improve documentation.
>
> Changes from v6:
> * Drop the usage of rank.
> * Address other review feedback.
>
> Changes from v5:
> * Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
>   are going to be used for features other than demotion. Hence keep all N_MEMORY
>   nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
> * Add NODE_DATA->memtier
> * Rearrage patches to add sysfs files later.
> * Add support to create memory tiers from userspace.
> * Address other review feedback.
>
>
> Changes from v4:
> * Address review feedback.
> * Reverse the meaning of "rank": higher rank value means higher tier.
> * Add "/sys/devices/system/memtier/default_tier".
> * Add node_is_toptier
>
> v4:
> Add support for explicit memory tiers and ranks.
>
> v3:
> - Modify patch 1 subject to make it more specific
> - Remove /sys/kernel/mm/numa/demotion_targets interface, use
>   /sys/devices/system/node/demotion_targets instead and make
>   it writable to override node_states[N_DEMOTION_TARGETS].
> - Add support to view per node demotion targets via sysfs
>
> v2:
> In v1, only 1st patch of this patch series was sent, which was
> implemented to avoid some of the limitations on the demotion
> target sharing, however for certain numa topology, the demotion
> targets found by that patch was not most optimal, so 1st patch
> in this series is modified according to suggestions from Huang
> and Baolin. Different examples of demotion list comparasion
> between existing implementation and changed implementation can
> be found in the commit message of 1st patch.
>
>
>
> Aneesh Kumar K.V (7):
>   mm/demotion: Add support for explicit memory tiers
>   mm/demotion: Move memory demotion related code
>   mm/demotion: Add hotplug callbacks to handle new numa node onlined
>   mm/demotion/dax/kmem: Set node's abstract distance to
>     MEMTIER_ADISTANCE_PMEM
>   mm/demotion: Build demotion targets based on explicit memory tiers
>   mm/demotion: Add pg_data_t member to track node memory tier details
>   mm/demotion: Update node_is_toptier to work with memory tiers
>
> Jagdish Gediya (1):
>   mm/demotion: Demote pages according to allocation fallback order
>
>  drivers/dax/kmem.c           |   9 +
>  include/linux/memory-tiers.h |  79 +++++
>  include/linux/migrate.h      |  15 -
>  include/linux/mmzone.h       |   3 +
>  include/linux/node.h         |   5 -
>  mm/Makefile                  |   1 +
>  mm/huge_memory.c             |   1 +
>  mm/memory-tiers.c            | 586 +++++++++++++++++++++++++++++++++++
>  mm/migrate.c                 | 453 +--------------------------
>  mm/mprotect.c                |   1 +
>  mm/vmscan.c                  |  59 +++-
>  mm/vmstat.c                  |   4 -
>  12 files changed, 725 insertions(+), 491 deletions(-)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 0/8] mm/demotion: Memory tiers and demotion
  2022-07-29  5:30 ` [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Huang, Ying
@ 2022-07-29  6:17   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-29  6:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> The current kernel has the basic memory tiering support: Inactive pages on a
>> higher tier NUMA node can be migrated (demoted) to a lower tier NUMA node to
>> make room for new allocations on the higher tier NUMA node. Frequently accessed
>> pages on a lower tier NUMA node can be migrated (promoted) to a higher tier NUMA
>> node to improve the performance.
>>
>> In the current kernel, memory tiers are defined implicitly via a demotion path
>> relationship between NUMA nodes, which is created during the kernel
>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>> current implementation puts all nodes with CPU into the top tier, and builds the
>> tier hierarchy tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for several
>> important use cases:
>>
>> * The current tier initialization code always initializes each memory-only NUMA
>>   node into a lower tier. But a memory-only NUMA node may have a high
>>   performance memory device (e.g. a DRAM device attached via CXL.mem or a
>>   DRAM-backed memory-only node on a virtual machine) and should be put into a
>>   higher tier.
>>
>> * The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>   system with HBM (e.g. GPU memory) devices, these memory-only HBM NUMA nodes
>>   should be in the top tier, and DRAM nodes with CPUs are better to be placed
>>   into the next lower tier.
>>
>> * Also because the current tier hierarchy always puts CPU nodes into the top
>>   tier, when a CPU is hot-added (or hot-removed) and triggers a memory node from
>>   CPU-less into a CPU node (or vice versa), the memory tier hierarchy gets
>>   changed, even though no memory node is added or removed. This can make the
>>   tier hierarchy unstable and make it difficult to support tier-based memory
>>   accounting.
>>
>> * A higher tier node can only be demoted to selected nodes on the next lower
>>   tier as defined by the demotion path, not any other node from any lower tier.
>>   This strict, hard-coded demotion order does not work in all use cases (e.g.
>>   some use cases may want to allow cross-socket demotion to another node in the
>>   same demotion tier as a fallback when the preferred demotion node is out of
>>   space), and has resulted in the feature request for an interface to override
>>   the system-wide, per-node demotion order from the userspace. This demotion
>>   order is also inconsistent with the page allocation fallback order when all
>>   the nodes in a higher tier are out of space: The page allocation can fall back
>>   to any node from any lower tier, whereas the demotion order doesn't allow
>>   that.
>>
>> This patch series make the creation of memory tiers explicit under
>> the control of device driver.
>>
>> Memory Tier Initialization
>> ==========================
>>
>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>> a specific type. The memory type of a device is represented by its abstract 
>> distance. A memory tier corresponds to a range of abstract distance. This allows
>> for classifying memory devices with a specific performance range into a memory
>> tier.
>>
>> By default, all memory nodes are assigned to the default tier with
>> abstract distance 512.
>>
>> A device driver can move its memory nodes from the default tier. For example,
>> PMEM can move its memory nodes below the default tier, whereas GPU can move its
>> memory nodes above the default tier.
>>
>> The kernel initialization code makes the decision on which exact tier a memory
>> node should be assigned to based on the requests from the device drivers as well
>> as the memory device hardware information provided by the firmware.
>>
>> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
>
> Some patch description of [0/8] is same as that of [1/8] originally.  It
> appears that you revised [1/8], but forget to revise [0/8] too.  Please
> do that.

I just sent v12 making sure smaller value of abstract distance imply
faster(higher) memory tier. I missed in that in v11. 

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-07-28 19:04 ` [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM Aneesh Kumar K.V
@ 2022-07-29  6:20   ` Huang, Ying
  2022-07-29  7:19     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-07-29  6:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> By default, all nodes are assigned to the default memory tier which
> is the memory tier designated for nodes with DRAM
>
> Set dax kmem device node's tier to slower memory tier by assigning
> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
> appears below the default memory tier in demotion order.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/dax/kmem.c           |  9 +++++++++
>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>  3 files changed, 43 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index a37622060fff..6b0d5de9a3e9 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/mman.h>
> +#include <linux/memory-tiers.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>  	struct resource *res[];
>  };
>  
> +static struct memory_dev_type default_pmem_type  = {

Why is this named as default_pmem_type?  We will not change the memory
type of a node usually.

> +	.adistance = MEMTIER_ADISTANCE_PMEM,
> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
> +	.nodes  = NODE_MASK_NONE,
> +};
> +
>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  {
>  	struct device *dev = &dev_dax->dev;
> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		return -EINVAL;
>  	}
>  
> +	init_node_memory_type(numa_node, &default_pmem_type);
> +

The memory hot-add below may fail.  So the error handling needs to be
added.

And, it appears that the memory type and memory tier of a node may be
fully initialized here before NUMA hot-adding started.  So I suggest to
set node_memory_types[] here only.  And set memory_dev_type->nodes in
node hot-add callback.  I think there is the proper place to complete
the initialization.

And, in theory dax/kmem.c can be unloaded.  So we need to clear
node_memory_types[] for nodes somewhere.

Best Regards,
Huang, Ying

>  	for (i = 0; i < dev_dax->nr_range; i++) {
>  		struct range range;
>  
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 976f43a5e3be..4f4baf0bf430 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_MEMORY_TIERS_H
>  #define _LINUX_MEMORY_TIERS_H
>  
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
>  /*
>   * Each tier cover a abstrace distance chunk size of 128
>   */
> @@ -15,12 +17,27 @@
>  #define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>  #define MEMTIER_HOTPLUG_PRIO	100
>  
> +struct memory_tier;
> +struct memory_dev_type {
> +	/* list of memory types that are are part of same tier as this type */
> +	struct list_head tier_sibiling;
> +	/* abstract distance for this specific memory type */
> +	int adistance;
> +	/* Nodes of same abstract distance */
> +	nodemask_t nodes;
> +	struct memory_tier *memtier;
> +};
> +
>  #ifdef CONFIG_NUMA
> -#include <linux/types.h>
>  extern bool numa_demotion_enabled;
> +struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type);
>  
>  #else
>  
>  #define numa_demotion_enabled	false
> +static inline struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index c9854a394d9b..109be75fa554 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -1,6 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0
> -#include <linux/types.h>
> -#include <linux/nodemask.h>
>  #include <linux/slab.h>
>  #include <linux/lockdep.h>
>  #include <linux/memory.h>
> @@ -19,16 +17,6 @@ struct memory_tier {
>  	int adistance_start;
>  };
>  
> -struct memory_dev_type {
> -	/* list of memory types that are are part of same tier as this type */
> -	struct list_head tier_sibiling;
> -	/* abstract distance for this specific memory type */
> -	int adistance;
> -	/* Nodes of same abstract distance */
> -	nodemask_t nodes;
> -	struct memory_tier *memtier;
> -};
> -
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
> @@ -141,6 +129,22 @@ static void clear_node_memory_tier(int node)
>  	mutex_unlock(&memory_tier_lock);
>  }
>  
> +struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type)
> +{
> +	struct memory_dev_type *mem_type;
> +
> +	mutex_lock(&memory_tier_lock);
> +	if (node_memory_types[node]) {
> +		mem_type = node_memory_types[node];
> +	} else {
> +		node_memory_types[node] = default_type;
> +		node_set(node, default_type->nodes);
> +		mem_type = default_type;
> +	}
> +	mutex_unlock(&memory_tier_lock);
> +	return mem_type;
> +}
> +
>  static int __meminit memtier_hotplug_callback(struct notifier_block *self,
>  					      unsigned long action, void *_arg)
>  {

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-28 19:04 ` [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-29  6:25   ` Huang, Ying
  2022-07-29  7:24     ` Aneesh Kumar K.V
       [not found]   ` <62e890da7f784_577a029473@dwillia2-xfh.jf.intel.com.notmuch>
  1 sibling, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-07-29  6:25 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a demotion path
> relationship between NUMA nodes, which is created during the kernel
> initialization and updated when a NUMA node is hot-added or hot-removed. The
> current implementation puts all nodes with CPU into the highest tier, and builds
> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
> based on the distances between nodes.
>
> This current memory tier kernel implementation needs to be improved for several
> important use cases,
>
> The current tier initialization code always initializes each memory-only NUMA
> node into a lower tier. But a memory-only NUMA node may have a high performance
> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
> should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top tier. But on a
> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
> the next lower tier.
>
> With current kernel higher tier node can only be demoted to nodes with shortest
> distance on the next lower tier as defined by the demotion path, not any other
> node from any lower tier. This strict, demotion order does not work in all use
> cases (e.g. some use cases may want to allow cross-socket demotion to another
> node in the same demotion tier as a fallback when the preferred demotion node is
> out of space), This demotion order is also inconsistent with the page allocation
> fallback order when all the nodes in a higher tier are out of space: The page
> allocation can fall back to any node from any lower tier, whereas the demotion
> order doesn't allow that.
>
> This patch series address the above by defining memory tiers explicitly.
>
> Linux kernel presents memory devices as NUMA nodes and each memory device is of
> a specific type. The memory type of a device is represented by its abstract
> distance. A memory tier corresponds to a range of abstract distance. This allows
> for classifying memory devices with a specific performance range into a memory
> tier.
>
> This patch configures the range/chunk size to be 128. The default DRAM
> abstract distance is 512. We can have 4 memory tiers below the default DRAM
> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
> Slower memory devices like persistent memory will have abstract distance below
> the default DRAM level and hence will be placed in these 4 lower tiers.

For abstract distance, the lower value means higher performance, higher
value means lower performance.  So the abstract distance of PMEM should
be smaller than that of DRAM.

> A kernel parameter is provided to override the default memory tier.

Forget to delete?

Best Regards,
Huang, Ying

> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  17 ++++++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..8d7884b7a3f0
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +/*
> + * Each tier cover a abstrace distance chunk size of 128
> + */
> +#define MEMTIER_CHUNK_BITS	7
> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
> +/* leave one tier below this slow pmem */
> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
> +
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..01cfd514c192
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	/* hierarchy of memory tiers */
> +	struct list_head list;
> +	/* list of all memory types part of this tier */
> +	struct list_head memory_types;
> +	/*
> +	 * start value of abstract distance. memory tier maps
> +	 * an abstract distance  range,
> +	 * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE
> +	 */
> +	int adistance_start;
> +};
> +
> +struct memory_dev_type {
> +	/* list of memory types that are are part of same tier as this type */
> +	struct list_head tier_sibiling;
> +	/* abstract distance for this specific memory type */
> +	int adistance;
> +	/* Nodes of same abstract distance */
> +	nodemask_t nodes;
> +	struct memory_tier *memtier;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +struct memory_dev_type *node_memory_types[MAX_NUMNODES];
> +/*
> + * For now let's have 4 memory tier below default DRAM tier.
> + */
> +static struct memory_dev_type default_dram_type  = {
> +	.adistance = MEMTIER_ADISTANCE_DRAM,
> +	.tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling),
> +};
> +
> +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype)
> +{
> +	bool found_slot = false;
> +	struct memory_tier *memtier, *new_memtier;
> +	int adistance = memtype->adistance;
> +	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	/*
> +	 * If the memtype is already part of a memory tier,
> +	 * just return that.
> +	 */
> +	if (memtype->memtier)
> +		return memtype->memtier;
> +
> +	adistance = round_down(adistance, memtier_adistance_chunk_size);
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		if (adistance == memtier->adistance_start) {
> +			memtype->memtier = memtier;
> +			list_add(&memtype->tier_sibiling, &memtier->memory_types);
> +			return memtier;
> +		} else if (adistance < memtier->adistance_start) {
> +			found_slot = true;
> +			break;
> +		}
> +	}
> +
> +	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!new_memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	new_memtier->adistance_start = adistance;
> +	INIT_LIST_HEAD(&new_memtier->list);
> +	INIT_LIST_HEAD(&new_memtier->memory_types);
> +	if (found_slot)
> +		list_add_tail(&new_memtier->list, &memtier->list);
> +	else
> +		list_add_tail(&new_memtier->list, &memory_tiers);
> +	memtype->memtier = new_memtier;
> +	list_add(&memtype->tier_sibiling, &new_memtier->memory_types);
> +	return new_memtier;
> +}
> +
> +static int __init memory_tier_init(void)
> +{
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	/* CPU only nodes are not part of memory tiers. */
> +	default_dram_type.nodes = node_states[N_MEMORY];
> +
> +	memtier = find_create_memory_tier(&default_dram_type);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-28 19:04 ` [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-29  6:35   ` Huang, Ying
  2022-07-29  7:22     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-07-29  6:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> This patch switch the demotion target building logic to use memory tiers
> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
> default memory tier and additional memory tiers will be added by drivers like
> dax kmem.
>
> This patch builds the demotion target for a NUMA node by looking at all
> memory tiers below the tier to which the NUMA node belongs. The closest node
> in the immediately following memory tier is used as a demotion target.
>
> Since we are now only building demotion target for N_MEMORY NUMA nodes
> the CPU hotplug calls are removed in this patch.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  13 ++
>  include/linux/migrate.h      |  13 --
>  mm/memory-tiers.c            | 221 +++++++++++++++++++-
>  mm/migrate.c                 | 394 -----------------------------------
>  mm/vmstat.c                  |   4 -
>  5 files changed, 233 insertions(+), 412 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 4f4baf0bf430..e56a57c6ef78 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -31,6 +31,14 @@ struct memory_dev_type {
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
>  struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *default_type);
> +#ifdef CONFIG_MIGRATION
> +int next_demotion_node(int node);
> +#else
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
> +#endif
>  
>  #else
>  
> @@ -39,5 +47,10 @@ static inline struct memory_dev_type *init_node_memory_type(int node, struct mem
>  {
>  	return ERR_PTR(-EINVAL);
>  }
> +
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 43e737215f33..93fab62e6548 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  
>  #endif /* CONFIG_MIGRATION */
>  
> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> -extern void set_migration_target_nodes(void);
> -extern void migrate_on_reclaim_init(void);
> -extern int next_demotion_node(int node);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -static inline void migrate_on_reclaim_init(void) {}
> -static inline int next_demotion_node(int node)
> -{
> -        return NUMA_NO_NODE;
> -}
> -#endif
> -
>  #ifdef CONFIG_COMPACTION
>  extern int PageMovable(struct page *page);
>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 109be75fa554..60845aa74afc 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -2,8 +2,11 @@
>  #include <linux/slab.h>
>  #include <linux/lockdep.h>
>  #include <linux/memory.h>
> +#include <linux/random.h>
>  #include <linux/memory-tiers.h>
>  
> +#include "internal.h"
> +
>  struct memory_tier {
>  	/* hierarchy of memory tiers */
>  	struct list_head list;
> @@ -17,9 +20,74 @@ struct memory_tier {
>  	int adistance_start;
>  };
>  
> +struct demotion_nodes {
> +	nodemask_t preferred;
> +};
> +
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
> +#ifdef CONFIG_MIGRATION
> +/*
> + * node_demotion[] examples:
> + *
> + * Example 1:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> + *
> + * node distances:
> + * node   0    1    2    3
> + *    0  10   20   30   40
> + *    1  20   10   40   30
> + *    2  30   40   10   40
> + *    3  40   30   40   10
> + *
> + * memory_tiers0 = 0-1
> + * memory_tiers1 = 2-3
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 3
> + * node_demotion[2].preferred = <empty>
> + * node_demotion[3].preferred = <empty>
> + *
> + * Example 2:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   30
> + *    2  30   30   10
> + *
> + * memory_tiers0 = 0-2
> + *
> + * node_demotion[0].preferred = <empty>
> + * node_demotion[1].preferred = <empty>
> + * node_demotion[2].preferred = <empty>
> + *
> + * Example 3:
> + *
> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   40
> + *    2  30   40   10
> + *
> + * memory_tiers0 = 1
> + * memory_tiers1 = 0
> + * memory_tiers2 = 2
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 0
> + * node_demotion[2].preferred = <empty>
> + *
> + */
> +static struct demotion_nodes *node_demotion __read_mostly;
> +#endif /* CONFIG_MIGRATION */
> +
>  /*
>   * For now let's have 4 memory tier below default DRAM tier.
>   */
> @@ -82,6 +150,144 @@ static struct memory_tier *__node_get_memory_tier(int node)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_MIGRATION
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * Return: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	target = node_random(&nd->preferred);

Don't find code to optimize node_random() for weight == 1 case, forget
to do that?

Best Regards,
Huang, Ying

> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> +static void disable_all_demotion_targets(void)
> +{
> +	int node;
> +
> +	for_each_node_state(node, N_MEMORY)
> +		node_demotion[node].preferred = NODE_MASK_NONE;
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 */
> +	synchronize_rcu();
> +}
> +
> +static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier)
> +{
> +	nodemask_t nodes = NODE_MASK_NONE;
> +	struct memory_dev_type *memtype;
> +
> +	list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling)
> +		nodes_or(nodes, nodes, memtype->nodes);
> +
> +	return nodes;
> +}
> +
> +/*
> + * Find an automatic demotion target for all memory
> + * nodes. Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static void establish_demotion_targets(void)
> +{
> +	struct memory_tier *memtier;
> +	struct demotion_nodes *nd;
> +	int target = NUMA_NO_NODE, node;
> +	int distance, best_distance;
> +	nodemask_t tier_nodes;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
> +		return;
> +
> +	disable_all_demotion_targets();
> +
> +	for_each_node_state(node, N_MEMORY) {
> +		best_distance = -1;
> +		nd = &node_demotion[node];
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (!memtier || list_is_first(&memtier->list, &memory_tiers))
> +			continue;
> +		/*
> +		 * Get the lower memtier to find the  demotion node list.
> +		 */
> +		memtier = list_prev_entry(memtier, list);
> +		tier_nodes = get_memtier_nodemask(memtier);
> +		/*
> +		 * find_next_best_node, use 'used' nodemask as a skip list.
> +		 * Add all memory nodes except the selected memory tier
> +		 * nodelist to skip list so that we find the best node from the
> +		 * memtier nodelist.
> +		 */
> +		nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes);
> +
> +		/*
> +		 * Find all the nodes in the memory tier node list of same best distance.
> +		 * add them to the preferred mask. We randomly select between nodes
> +		 * in the preferred mask when allocating pages during demotion.
> +		 */
> +		do {
> +			target = find_next_best_node(node, &tier_nodes);
> +			if (target == NUMA_NO_NODE)
> +				break;
> +
> +			distance = node_distance(node, target);
> +			if (distance == best_distance || best_distance == -1) {
> +				best_distance = distance;
> +				node_set(target, nd->preferred);
> +			} else {
> +				break;
> +			}
> +		} while (1);
> +	}
> +}
> +
> +#else
> +static inline void disable_all_demotion_targets(void) {}
> +static inline void establish_demotion_targets(void) {}
> +#endif /* CONFIG_MIGRATION */
> +
>  static void init_node_memory_tier(int node)
>  {
>  	struct memory_tier *memtier;
> @@ -89,6 +295,13 @@ static void init_node_memory_tier(int node)
>  	mutex_lock(&memory_tier_lock);
>  
>  	memtier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_demotion_targets
> +	 * will have skipped this node.
> +	 */
>  	if (!memtier) {
>  		struct memory_dev_type *memtype;
>  
> @@ -99,6 +312,7 @@ static void init_node_memory_tier(int node)
>  		memtype = node_memory_types[node];
>  		memtier = find_create_memory_tier(memtype);
>  	}
> +	establish_demotion_targets();
>  	mutex_unlock(&memory_tier_lock);
>  }
>  
> @@ -125,6 +339,7 @@ static void clear_node_memory_tier(int node)
>  			if (list_empty(&current_memtier->memory_types))
>  				destroy_memory_tier(current_memtier);
>  		}
> +		establish_demotion_targets();
>  	}
>  	mutex_unlock(&memory_tier_lock);
>  }
> @@ -182,7 +397,11 @@ static int __init memory_tier_init(void)
>  		panic("%s() failed to register memory tier: %ld\n",
>  		      __func__, PTR_ERR(memtier));
>  	mutex_unlock(&memory_tier_lock);
> -
> +#ifdef CONFIG_MIGRATION
> +	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
> +				GFP_KERNEL);
> +	WARN_ON(!node_demotion);
> +#endif
>  	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO);
>  	return 0;
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fce7d4a9e940..c758c9c21d7d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -/*
> - * node_demotion[] example:
> - *
> - * Consider a system with two sockets.  Each socket has
> - * three classes of memory attached: fast, medium and slow.
> - * Each memory class is placed in its own NUMA node.  The
> - * CPUs are placed in the node with the "fast" memory.  The
> - * 6 NUMA nodes (0-5) might be split among the sockets like
> - * this:
> - *
> - *	Socket A: 0, 1, 2
> - *	Socket B: 3, 4, 5
> - *
> - * When Node 0 fills up, its memory should be migrated to
> - * Node 1.  When Node 1 fills up, it should be migrated to
> - * Node 2.  The migration path start on the nodes with the
> - * processors (since allocations default to this node) and
> - * fast memory, progress through medium and end with the
> - * slow memory:
> - *
> - *	0 -> 1 -> 2 -> stop
> - *	3 -> 4 -> 5 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *
> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
> - *
> - * Moreover some systems may have multiple slow memory nodes.
> - * Suppose a system has one socket with 3 memory nodes, node 0
> - * is fast memory type, and node 1/2 both are slow memory
> - * type, and the distance between fast memory node and slow
> - * memory node is same. So the migration path should be:
> - *
> - *	0 -> 1/2 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
> - */
> -
> -/*
> - * Writes to this array occur without locking.  Cycles are
> - * not allowed: Node X demotes to Y which demotes to X...
> - *
> - * If multiple reads are performed, a single rcu_read_lock()
> - * must be held over all reads to ensure that no cycles are
> - * observed.
> - */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -	unsigned short nr;
> -	short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> -
> -/**
> - * next_demotion_node() - Get the next node in the demotion path
> - * @node: The starting node to lookup the next node
> - *
> - * Return: node id for next memory node in the demotion path hierarchy
> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> - * @node online or guarantee that it *continues* to be the next demotion
> - * target.
> - */
> -int next_demotion_node(int node)
> -{
> -	struct demotion_nodes *nd;
> -	unsigned short target_nr, index;
> -	int target;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	/*
> -	 * node_demotion[] is updated without excluding this
> -	 * function from running.  RCU doesn't provide any
> -	 * compiler barriers, so the READ_ONCE() is required
> -	 * to avoid compiler reordering or read merging.
> -	 *
> -	 * Make sure to use RCU over entire code blocks if
> -	 * node_demotion[] reads need to be consistent.
> -	 */
> -	rcu_read_lock();
> -	target_nr = READ_ONCE(nd->nr);
> -
> -	switch (target_nr) {
> -	case 0:
> -		target = NUMA_NO_NODE;
> -		goto out;
> -	case 1:
> -		index = 0;
> -		break;
> -	default:
> -		/*
> -		 * If there are multiple target nodes, just select one
> -		 * target node randomly.
> -		 *
> -		 * In addition, we can also use round-robin to select
> -		 * target node, but we should introduce another variable
> -		 * for node_demotion[] to record last selected target node,
> -		 * that may cause cache ping-pong due to the changing of
> -		 * last target node. Or introducing per-cpu data to avoid
> -		 * caching issue, which seems more complicated. So selecting
> -		 * target node randomly seems better until now.
> -		 */
> -		index = get_random_int() % target_nr;
> -		break;
> -	}
> -
> -	target = READ_ONCE(nd->nodes[index]);
> -
> -out:
> -	rcu_read_unlock();
> -	return target;
> -}
> -
> -/* Disable reclaim-based migration. */
> -static void __disable_all_migrate_targets(void)
> -{
> -	int node, i;
> -
> -	if (!node_demotion)
> -		return;
> -
> -	for_each_online_node(node) {
> -		node_demotion[node].nr = 0;
> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
> -	}
> -}
> -
> -static void disable_all_migrate_targets(void)
> -{
> -	__disable_all_migrate_targets();
> -
> -	/*
> -	 * Ensure that the "disable" is visible across the system.
> -	 * Readers will see either a combination of before+disable
> -	 * state or disable+after.  They will never see before and
> -	 * after state together.
> -	 *
> -	 * The before+after state together might have cycles and
> -	 * could cause readers to do things like loop until this
> -	 * function finishes.  This ensures they can only see a
> -	 * single "bad" read and would, for instance, only loop
> -	 * once.
> -	 */
> -	synchronize_rcu();
> -}
> -
> -/*
> - * Find an automatic demotion target for 'node'.
> - * Failing here is OK.  It might just indicate
> - * being at the end of a chain.
> - */
> -static int establish_migrate_target(int node, nodemask_t *used,
> -				    int best_distance)
> -{
> -	int migration_target, index, val;
> -	struct demotion_nodes *nd;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	migration_target = find_next_best_node(node, used);
> -	if (migration_target == NUMA_NO_NODE)
> -		return NUMA_NO_NODE;
> -
> -	/*
> -	 * If the node has been set a migration target node before,
> -	 * which means it's the best distance between them. Still
> -	 * check if this node can be demoted to other target nodes
> -	 * if they have a same best distance.
> -	 */
> -	if (best_distance != -1) {
> -		val = node_distance(node, migration_target);
> -		if (val > best_distance)
> -			goto out_clear;
> -	}
> -
> -	index = nd->nr;
> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
> -		      "Exceeds maximum demotion target nodes\n"))
> -		goto out_clear;
> -
> -	nd->nodes[index] = migration_target;
> -	nd->nr++;
> -
> -	return migration_target;
> -out_clear:
> -	node_clear(migration_target, *used);
> -	return NUMA_NO_NODE;
> -}
> -
> -/*
> - * When memory fills up on a node, memory contents can be
> - * automatically migrated to another node instead of
> - * discarded at reclaim.
> - *
> - * Establish a "migration path" which will start at nodes
> - * with CPUs and will follow the priorities used to build the
> - * page allocator zonelists.
> - *
> - * The difference here is that cycles must be avoided.  If
> - * node0 migrates to node1, then neither node1, nor anything
> - * node1 migrates to can migrate to node0. Also one node can
> - * be migrated to multiple nodes if the target nodes all have
> - * a same best-distance against the source node.
> - *
> - * This function can run simultaneously with readers of
> - * node_demotion[].  However, it can not run simultaneously
> - * with itself.  Exclusion is provided by memory hotplug events
> - * being single-threaded.
> - */
> -static void __set_migration_target_nodes(void)
> -{
> -	nodemask_t next_pass;
> -	nodemask_t this_pass;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> -	int node, best_distance;
> -
> -	/*
> -	 * Avoid any oddities like cycles that could occur
> -	 * from changes in the topology.  This will leave
> -	 * a momentary gap when migration is disabled.
> -	 */
> -	disable_all_migrate_targets();
> -
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> -
> -	for_each_node_mask(node, this_pass) {
> -		best_distance = -1;
> -
> -		/*
> -		 * Try to set up the migration path for the node, and the target
> -		 * migration nodes can be multiple, so doing a loop to find all
> -		 * the target nodes if they all have a best node distance.
> -		 */
> -		do {
> -			int target_node =
> -				establish_migrate_target(node, &used_targets,
> -							 best_distance);
> -
> -			if (target_node == NUMA_NO_NODE)
> -				break;
> -
> -			if (best_distance == -1)
> -				best_distance = node_distance(node, target_node);
> -
> -			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> -			 */
> -			node_set(target_node, next_pass);
> -		} while (1);
> -	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> -}
> -
> -/*
> - * For callers that do not hold get_online_mems() already.
> - */
> -void set_migration_target_nodes(void)
> -{
> -	get_online_mems();
> -	__set_migration_target_nodes();
> -	put_online_mems();
> -}
> -
> -/*
> - * This leaves migrate-on-reclaim transiently disabled between
> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
> - * whether reclaim-based migration is enabled or not, which
> - * ensures that the user can turn reclaim-based migration at
> - * any time without needing to recalculate migration targets.
> - *
> - * These callbacks already hold get_online_mems().  That is why
> - * __set_migration_target_nodes() can be used as opposed to
> - * set_migration_target_nodes().
> - */
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> -						 unsigned long action, void *_arg)
> -{
> -	struct memory_notify *arg = _arg;
> -
> -	/*
> -	 * Only update the node migration order when a node is
> -	 * changing status, like online->offline.  This avoids
> -	 * the overhead of synchronize_rcu() in most cases.
> -	 */
> -	if (arg->status_change_nid < 0)
> -		return notifier_from_errno(0);
> -
> -	switch (action) {
> -	case MEM_GOING_OFFLINE:
> -		/*
> -		 * Make sure there are not transient states where
> -		 * an offline node is a migration target.  This
> -		 * will leave migration disabled until the offline
> -		 * completes and the MEM_OFFLINE case below runs.
> -		 */
> -		disable_all_migrate_targets();
> -		break;
> -	case MEM_OFFLINE:
> -	case MEM_ONLINE:
> -		/*
> -		 * Recalculate the target nodes once the node
> -		 * reaches its final state (online or offline).
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_CANCEL_OFFLINE:
> -		/*
> -		 * MEM_GOING_OFFLINE disabled all the migration
> -		 * targets.  Reenable them.
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_GOING_ONLINE:
> -	case MEM_CANCEL_ONLINE:
> -		break;
> -	}
> -
> -	return notifier_from_errno(0);
> -}
> -#endif
> -
> -void __init migrate_on_reclaim_init(void)
> -{
> -	node_demotion = kcalloc(nr_node_ids,
> -				sizeof(struct demotion_nodes),
> -				GFP_KERNEL);
> -	WARN_ON(!node_demotion);
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> -#endif
> -	/*
> -	 * At this point, all numa nodes with memory/CPus have their state
> -	 * properly set, so we can build the demotion order now.
> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
> -	 * CPU hotplug events during boot.
> -	 */
> -	cpus_read_lock();
> -	set_migration_target_nodes();
> -	cpus_read_unlock();
> -}
>  #endif /* CONFIG_NUMA */
> -
> -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 373d2730fcf2..35c6ff97cf29 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -28,7 +28,6 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> -#include <linux/migrate.h>
>  
>  #include "internal.h"
>  
> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>  
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> -		set_migration_target_nodes();
>  	}
>  
>  	return 0;
> @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  		return 0;
>  
>  	node_clear_state(node, N_CPU);
> -	set_migration_target_nodes();
>  
>  	return 0;
>  }
> @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
>  
>  	start_shepherd_timer();
>  #endif
> -	migrate_on_reclaim_init();
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
>  	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-28 19:04 ` [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
@ 2022-07-29  6:39   ` Huang, Ying
  2022-07-29  6:41     ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-07-29  6:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> With memory tiers support we can have memory only NUMA nodes
> in the top tier from which we want to avoid promotion tracking NUMA
> faults. Update node_is_toptier to work with memory tiers.
> All NUMA nodes are by default top tier nodes. With lower memory
> tiers added we consider all memory tiers above a memory tier having
> CPU NUMA nodes as a top memory tier
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 11 ++++++++++
>  include/linux/node.h         |  5 -----
>  mm/huge_memory.c             |  1 +
>  mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
>  mm/migrate.c                 |  1 +
>  mm/mprotect.c                |  1 +
>  6 files changed, 56 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index f8dbeda617a7..bc9fb9d39b2c 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -35,6 +35,7 @@ struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> +bool node_is_toptier(int node);
>  #else
>  static inline int next_demotion_node(int node)
>  {
> @@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>  {
>  	*targets = NODE_MASK_NONE;
>  }
> +
> +static inline bool node_is_toptier(int node)
> +{
> +	return true;
> +}
>  #endif
>  
>  #else
> @@ -64,5 +70,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>  {
>  	*targets = NODE_MASK_NONE;
>  }
> +
> +static inline bool node_is_toptier(int node)
> +{
> +	return true;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/node.h b/include/linux/node.h
> index 40d641a8bfb0..9ec680dd607f 100644
> --- a/include/linux/node.h
> +++ b/include/linux/node.h
> @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>  
>  #define to_node(device) container_of(device, struct node, dev)
>  
> -static inline bool node_is_toptier(int node)
> -{
> -	return node_state(node, N_CPU);
> -}
> -
>  #endif /* _LINUX_NODE_H_ */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 834f288b3769..8405662646e9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -35,6 +35,7 @@
>  #include <linux/numa.h>
>  #include <linux/page_owner.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/memory-tiers.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 84e2be31a853..36d87dc422ab 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -30,6 +30,7 @@ static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
>  #ifdef CONFIG_MIGRATION
> +static int top_tier_adistance;
>  /*
>   * node_demotion[] examples:
>   *
> @@ -159,6 +160,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>  }
>  
>  #ifdef CONFIG_MIGRATION
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->adistance_start >= top_tier_adistance)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>  {
>  	struct memory_tier *memtier;
> @@ -315,6 +341,22 @@ static void establish_demotion_targets(void)
>  			}
>  		} while (1);
>  	}
> +	/*
> +	 * Promotion is allowed from a memory tier to higher
> +	 * memory tier only if the memory tier doesn't include
> +	 * compute. We want to  skip promotion from a memory tier,
> +	 * if any node that is  part of the memory tier have CPUs.
> +	 * Once we detect such a memory tier, we consider that tier
> +	 * as top tiper from which promotion on is not allowed.
> +	 */
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		tier_nodes = get_memtier_nodemask(memtier);
> +		nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
> +		if (!nodes_empty(tier_nodes)) {
> +			top_tier_adistance = memtier->adistance_start;

IMHO, this should be,

			top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE;

Best Regards,
Huang, Ying

> +			break;
> +		}
> +	}
>  	/*
>  	 * Now build the lower_tier mask for each node collecting node mask from
>  	 * all memory tier below it. This allows us to fallback demotion page
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c758c9c21d7d..1da81136eaaa 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -50,6 +50,7 @@
>  #include <linux/memory.h>
>  #include <linux/random.h>
>  #include <linux/sched/sysctl.h>
> +#include <linux/memory-tiers.h>
>  
>  #include <asm/tlbflush.h>
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ba5592655ee3..92a2fc0fa88b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -31,6 +31,7 @@
>  #include <linux/pgtable.h>
>  #include <linux/sched/sysctl.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/memory-tiers.h>
>  #include <asm/cacheflush.h>
>  #include <asm/mmu_context.h>
>  #include <asm/tlbflush.h>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-29  6:39   ` Huang, Ying
@ 2022-07-29  6:41     ` Aneesh Kumar K V
  2022-07-29  6:47       ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-07-29  6:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/29/22 12:09 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> With memory tiers support we can have memory only NUMA nodes
>> in the top tier from which we want to avoid promotion tracking NUMA
>> faults. Update node_is_toptier to work with memory tiers.
>> All NUMA nodes are by default top tier nodes. With lower memory
>> tiers added we consider all memory tiers above a memory tier having
>> CPU NUMA nodes as a top memory tier
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h | 11 ++++++++++
>>  include/linux/node.h         |  5 -----
>>  mm/huge_memory.c             |  1 +
>>  mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
>>  mm/migrate.c                 |  1 +
>>  mm/mprotect.c                |  1 +
>>  6 files changed, 56 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index f8dbeda617a7..bc9fb9d39b2c 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -35,6 +35,7 @@ struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *
>>  #ifdef CONFIG_MIGRATION
>>  int next_demotion_node(int node);
>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>> +bool node_is_toptier(int node);
>>  #else
>>  static inline int next_demotion_node(int node)
>>  {
>> @@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>  {
>>  	*targets = NODE_MASK_NONE;
>>  }
>> +
>> +static inline bool node_is_toptier(int node)
>> +{
>> +	return true;
>> +}
>>  #endif
>>  
>>  #else
>> @@ -64,5 +70,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>  {
>>  	*targets = NODE_MASK_NONE;
>>  }
>> +
>> +static inline bool node_is_toptier(int node)
>> +{
>> +	return true;
>> +}
>>  #endif	/* CONFIG_NUMA */
>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>> diff --git a/include/linux/node.h b/include/linux/node.h
>> index 40d641a8bfb0..9ec680dd607f 100644
>> --- a/include/linux/node.h
>> +++ b/include/linux/node.h
>> @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>>  
>>  #define to_node(device) container_of(device, struct node, dev)
>>  
>> -static inline bool node_is_toptier(int node)
>> -{
>> -	return node_state(node, N_CPU);
>> -}
>> -
>>  #endif /* _LINUX_NODE_H_ */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 834f288b3769..8405662646e9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -35,6 +35,7 @@
>>  #include <linux/numa.h>
>>  #include <linux/page_owner.h>
>>  #include <linux/sched/sysctl.h>
>> +#include <linux/memory-tiers.h>
>>  
>>  #include <asm/tlb.h>
>>  #include <asm/pgalloc.h>
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 84e2be31a853..36d87dc422ab 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -30,6 +30,7 @@ static DEFINE_MUTEX(memory_tier_lock);
>>  static LIST_HEAD(memory_tiers);
>>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
>>  #ifdef CONFIG_MIGRATION
>> +static int top_tier_adistance;
>>  /*
>>   * node_demotion[] examples:
>>   *
>> @@ -159,6 +160,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>  }
>>  
>>  #ifdef CONFIG_MIGRATION
>> +bool node_is_toptier(int node)
>> +{
>> +	bool toptier;
>> +	pg_data_t *pgdat;
>> +	struct memory_tier *memtier;
>> +
>> +	pgdat = NODE_DATA(node);
>> +	if (!pgdat)
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	memtier = rcu_dereference(pgdat->memtier);
>> +	if (!memtier) {
>> +		toptier = true;
>> +		goto out;
>> +	}
>> +	if (memtier->adistance_start >= top_tier_adistance)
>> +		toptier = true;
>> +	else
>> +		toptier = false;
>> +out:
>> +	rcu_read_unlock();
>> +	return toptier;
>> +}
>> +
>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>  {
>>  	struct memory_tier *memtier;
>> @@ -315,6 +341,22 @@ static void establish_demotion_targets(void)
>>  			}
>>  		} while (1);
>>  	}
>> +	/*
>> +	 * Promotion is allowed from a memory tier to higher
>> +	 * memory tier only if the memory tier doesn't include
>> +	 * compute. We want to  skip promotion from a memory tier,
>> +	 * if any node that is  part of the memory tier have CPUs.
>> +	 * Once we detect such a memory tier, we consider that tier
>> +	 * as top tiper from which promotion on is not allowed.
>> +	 */
>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>> +		tier_nodes = get_memtier_nodemask(memtier);
>> +		nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
>> +		if (!nodes_empty(tier_nodes)) {
>> +			top_tier_adistance = memtier->adistance_start;
> 
> IMHO, this should be,
> 
> 			top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE;
> 

Good catch. Will update. BTW i did send v12 version of the patchset already to the list. 

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-29  6:41     ` Aneesh Kumar K V
@ 2022-07-29  6:47       ` Aneesh Kumar K V
  2022-08-01  1:04         ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-07-29  6:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 7/29/22 12:11 PM, Aneesh Kumar K V wrote:
> On 7/29/22 12:09 PM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> With memory tiers support we can have memory only NUMA nodes
>>> in the top tier from which we want to avoid promotion tracking NUMA
>>> faults. Update node_is_toptier to work with memory tiers.
>>> All NUMA nodes are by default top tier nodes. With lower memory
>>> tiers added we consider all memory tiers above a memory tier having
>>> CPU NUMA nodes as a top memory tier
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/memory-tiers.h | 11 ++++++++++
>>>  include/linux/node.h         |  5 -----
>>>  mm/huge_memory.c             |  1 +
>>>  mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
>>>  mm/migrate.c                 |  1 +
>>>  mm/mprotect.c                |  1 +
>>>  6 files changed, 56 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> index f8dbeda617a7..bc9fb9d39b2c 100644
>>> --- a/include/linux/memory-tiers.h
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -35,6 +35,7 @@ struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *
>>>  #ifdef CONFIG_MIGRATION
>>>  int next_demotion_node(int node);
>>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>> +bool node_is_toptier(int node);
>>>  #else
>>>  static inline int next_demotion_node(int node)
>>>  {
>>> @@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>>  {
>>>  	*targets = NODE_MASK_NONE;
>>>  }
>>> +
>>> +static inline bool node_is_toptier(int node)
>>> +{
>>> +	return true;
>>> +}
>>>  #endif
>>>  
>>>  #else
>>> @@ -64,5 +70,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>>  {
>>>  	*targets = NODE_MASK_NONE;
>>>  }
>>> +
>>> +static inline bool node_is_toptier(int node)
>>> +{
>>> +	return true;
>>> +}
>>>  #endif	/* CONFIG_NUMA */
>>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>> index 40d641a8bfb0..9ec680dd607f 100644
>>> --- a/include/linux/node.h
>>> +++ b/include/linux/node.h
>>> @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>>>  
>>>  #define to_node(device) container_of(device, struct node, dev)
>>>  
>>> -static inline bool node_is_toptier(int node)
>>> -{
>>> -	return node_state(node, N_CPU);
>>> -}
>>> -
>>>  #endif /* _LINUX_NODE_H_ */
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 834f288b3769..8405662646e9 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -35,6 +35,7 @@
>>>  #include <linux/numa.h>
>>>  #include <linux/page_owner.h>
>>>  #include <linux/sched/sysctl.h>
>>> +#include <linux/memory-tiers.h>
>>>  
>>>  #include <asm/tlb.h>
>>>  #include <asm/pgalloc.h>
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index 84e2be31a853..36d87dc422ab 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -30,6 +30,7 @@ static DEFINE_MUTEX(memory_tier_lock);
>>>  static LIST_HEAD(memory_tiers);
>>>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
>>>  #ifdef CONFIG_MIGRATION
>>> +static int top_tier_adistance;
>>>  /*
>>>   * node_demotion[] examples:
>>>   *
>>> @@ -159,6 +160,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>>  }
>>>  
>>>  #ifdef CONFIG_MIGRATION
>>> +bool node_is_toptier(int node)
>>> +{
>>> +	bool toptier;
>>> +	pg_data_t *pgdat;
>>> +	struct memory_tier *memtier;
>>> +
>>> +	pgdat = NODE_DATA(node);
>>> +	if (!pgdat)
>>> +		return false;
>>> +
>>> +	rcu_read_lock();
>>> +	memtier = rcu_dereference(pgdat->memtier);
>>> +	if (!memtier) {
>>> +		toptier = true;
>>> +		goto out;
>>> +	}
>>> +	if (memtier->adistance_start >= top_tier_adistance)
>>> +		toptier = true;
>>> +	else
>>> +		toptier = false;
>>> +out:
>>> +	rcu_read_unlock();
>>> +	return toptier;
>>> +}
>>> +
>>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>  {
>>>  	struct memory_tier *memtier;
>>> @@ -315,6 +341,22 @@ static void establish_demotion_targets(void)
>>>  			}
>>>  		} while (1);
>>>  	}
>>> +	/*
>>> +	 * Promotion is allowed from a memory tier to higher
>>> +	 * memory tier only if the memory tier doesn't include
>>> +	 * compute. We want to  skip promotion from a memory tier,
>>> +	 * if any node that is  part of the memory tier have CPUs.
>>> +	 * Once we detect such a memory tier, we consider that tier
>>> +	 * as top tiper from which promotion on is not allowed.
>>> +	 */
>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>> +		tier_nodes = get_memtier_nodemask(memtier);
>>> +		nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
>>> +		if (!nodes_empty(tier_nodes)) {
>>> +			top_tier_adistance = memtier->adistance_start;
>>
>> IMHO, this should be,
>>
>> 			top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE;
>>
> 
> Good catch. Will update. BTW i did send v12 version of the patchset already to the list. 
> 
>

Checking this again, we consider a node top tier if the node's memtier abstract distance
satisfy the below.  

	if (memtier->adistance_start <= top_tier_adistance)
		toptier = true;
	
With that we should be good with the current code. But I agree with you that top_tier_distance
should cover the full range of the top memory tier.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-07-29  6:20   ` Huang, Ying
@ 2022-07-29  7:19     ` Aneesh Kumar K.V
  2022-08-01  2:06       ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-29  7:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> By default, all nodes are assigned to the default memory tier which
>> is the memory tier designated for nodes with DRAM
>>
>> Set dax kmem device node's tier to slower memory tier by assigning
>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>> appears below the default memory tier in demotion order.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  drivers/dax/kmem.c           |  9 +++++++++
>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index a37622060fff..6b0d5de9a3e9 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/fs.h>
>>  #include <linux/mm.h>
>>  #include <linux/mman.h>
>> +#include <linux/memory-tiers.h>
>>  #include "dax-private.h"
>>  #include "bus.h"
>>  
>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>  	struct resource *res[];
>>  };
>>  
>> +static struct memory_dev_type default_pmem_type  = {
>
> Why is this named as default_pmem_type?  We will not change the memory
> type of a node usually.
>

Any other suggestion? pmem_dev_type? 


>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>> +	.nodes  = NODE_MASK_NONE,
>> +};
>> +
>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  {
>>  	struct device *dev = &dev_dax->dev;
>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  		return -EINVAL;
>>  	}
>>  
>> +	init_node_memory_type(numa_node, &default_pmem_type);
>> +
>
> The memory hot-add below may fail.  So the error handling needs to be
> added.
>
> And, it appears that the memory type and memory tier of a node may be
> fully initialized here before NUMA hot-adding started.  So I suggest to
> set node_memory_types[] here only.  And set memory_dev_type->nodes in
> node hot-add callback.  I think there is the proper place to complete
> the initialization.
>
> And, in theory dax/kmem.c can be unloaded.  So we need to clear
> node_memory_types[] for nodes somewhere.
>

I guess by module exit we can be sure that all the memory managed
by dax/kmem is hotplugged out. How about something like below?

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 6b0d5de9a3e9..eb4e158012a9 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -248,6 +248,7 @@ static void __exit dax_kmem_exit(void)
 	dax_driver_unregister(&device_dax_kmem_driver);
 	if (!any_hotremove_failed)
 		kfree_const(kmem_name);
+	unregister_memory_type(&default_pmem_type);
 }
 
 MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index fc6b7a14da51..8355baf5b8b4 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -31,6 +31,7 @@ struct memory_dev_type {
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
+void unregister_memory_type(struct memory_dev_type *memtype);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node)
 #define numa_demotion_enabled	false
 static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
 {
+}
+
+static inline void unregister_memory_type(struct memory_dev_type *memtype)
+{
 
 }
 
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 064e0f932795..4d29ebd4c4f3 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -500,6 +500,28 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type)
 	mutex_unlock(&memory_tier_lock);
 }
 
+void unregister_memory_type(struct memory_dev_type *memtype)
+{
+	int node;
+	struct memory_tier *memtier = memtype->memtier;
+
+	mutex_lock(&memory_tier_lock);
+	for(node = 0; node < MAX_NUMNODES; node++) {
+		if (node_memory_types[node] == memtype) {
+			if (!nodes_empty(memtype->nodes))
+				WARN_ON(1);
+			node_memory_types[node] = NULL;
+		}
+	}
+
+	list_del(&memtype->tier_sibiling);
+	memtype->memtier = NULL;
+	if (list_empty(&memtier->memory_types))
+		destroy_memory_tier(memtier);
+
+	mutex_unlock(&memory_tier_lock);
+}
+
 void update_node_adistance(int node, struct memory_dev_type *memtype)
 {
 	pg_data_t *pgdat;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-29  6:35   ` Huang, Ying
@ 2022-07-29  7:22     ` Aneesh Kumar K.V
  2022-08-01  2:15       ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-29  7:22 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> + */

....

>> +int next_demotion_node(int node)
>> +{
>> +	struct demotion_nodes *nd;
>> +	int target;
>> +
>> +	if (!node_demotion)
>> +		return NUMA_NO_NODE;
>> +
>> +	nd = &node_demotion[node];
>> +
>> +	/*
>> +	 * node_demotion[] is updated without excluding this
>> +	 * function from running.
>> +	 *
>> +	 * Make sure to use RCU over entire code blocks if
>> +	 * node_demotion[] reads need to be consistent.
>> +	 */
>> +	rcu_read_lock();
>> +	/*
>> +	 * If there are multiple target nodes, just select one
>> +	 * target node randomly.
>> +	 *
>> +	 * In addition, we can also use round-robin to select
>> +	 * target node, but we should introduce another variable
>> +	 * for node_demotion[] to record last selected target node,
>> +	 * that may cause cache ping-pong due to the changing of
>> +	 * last target node. Or introducing per-cpu data to avoid
>> +	 * caching issue, which seems more complicated. So selecting
>> +	 * target node randomly seems better until now.
>> +	 */
>> +	target = node_random(&nd->preferred);
>
> Don't find code to optimize node_random() for weight == 1 case, forget
> to do that?

I guess you suggested to do that as the patch for node_random or did I
got the review feedback wrong?

https://lore.kernel.org/linux-mm/87y1wdn30p.fsf@yhuang6-desk2.ccr.corp.intel.com

The change for node_random will be patch outside this series.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-29  6:25   ` Huang, Ying
@ 2022-07-29  7:24     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-29  7:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> In the current kernel, memory tiers are defined implicitly via a demotion path
>> relationship between NUMA nodes, which is created during the kernel
>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>> current implementation puts all nodes with CPU into the highest tier, and builds
>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>> based on the distances between nodes.
>>
>> This current memory tier kernel implementation needs to be improved for several
>> important use cases,
>>
>> The current tier initialization code always initializes each memory-only NUMA
>> node into a lower tier. But a memory-only NUMA node may have a high performance
>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>> should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>> the next lower tier.
>>
>> With current kernel higher tier node can only be demoted to nodes with shortest
>> distance on the next lower tier as defined by the demotion path, not any other
>> node from any lower tier. This strict, demotion order does not work in all use
>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>> node in the same demotion tier as a fallback when the preferred demotion node is
>> out of space), This demotion order is also inconsistent with the page allocation
>> fallback order when all the nodes in a higher tier are out of space: The page
>> allocation can fall back to any node from any lower tier, whereas the demotion
>> order doesn't allow that.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>> a specific type. The memory type of a device is represented by its abstract
>> distance. A memory tier corresponds to a range of abstract distance. This allows
>> for classifying memory devices with a specific performance range into a memory
>> tier.
>>
>> This patch configures the range/chunk size to be 128. The default DRAM
>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>> Slower memory devices like persistent memory will have abstract distance below
>> the default DRAM level and hence will be placed in these 4 lower tiers.
>
> For abstract distance, the lower value means higher performance, higher
> value means lower performance.  So the abstract distance of PMEM should
> be smaller than that of DRAM.

I noticed that after sending v11 and did send v12 fixing that already
which can be found 

https://lore.kernel.org/linux-mm/20220729061349.968148-1-aneesh.kumar@linux.ibm.com

>
>> A kernel parameter is provided to override the default memory tier.
>
> Forget to delete?

yes. Also fixed in v12.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-29  6:47       ` Aneesh Kumar K V
@ 2022-08-01  1:04         ` Huang, Ying
  0 siblings, 0 replies; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  1:04 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/29/22 12:11 PM, Aneesh Kumar K V wrote:
>> On 7/29/22 12:09 PM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> With memory tiers support we can have memory only NUMA nodes
>>>> in the top tier from which we want to avoid promotion tracking NUMA
>>>> faults. Update node_is_toptier to work with memory tiers.
>>>> All NUMA nodes are by default top tier nodes. With lower memory
>>>> tiers added we consider all memory tiers above a memory tier having
>>>> CPU NUMA nodes as a top memory tier
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>  include/linux/memory-tiers.h | 11 ++++++++++
>>>>  include/linux/node.h         |  5 -----
>>>>  mm/huge_memory.c             |  1 +
>>>>  mm/memory-tiers.c            | 42 ++++++++++++++++++++++++++++++++++++
>>>>  mm/migrate.c                 |  1 +
>>>>  mm/mprotect.c                |  1 +
>>>>  6 files changed, 56 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>> index f8dbeda617a7..bc9fb9d39b2c 100644
>>>> --- a/include/linux/memory-tiers.h
>>>> +++ b/include/linux/memory-tiers.h
>>>> @@ -35,6 +35,7 @@ struct memory_dev_type *init_node_memory_type(int node, struct memory_dev_type *
>>>>  #ifdef CONFIG_MIGRATION
>>>>  int next_demotion_node(int node);
>>>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
>>>> +bool node_is_toptier(int node);
>>>>  #else
>>>>  static inline int next_demotion_node(int node)
>>>>  {
>>>> @@ -45,6 +46,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>>>  {
>>>>  	*targets = NODE_MASK_NONE;
>>>>  }
>>>> +
>>>> +static inline bool node_is_toptier(int node)
>>>> +{
>>>> +	return true;
>>>> +}
>>>>  #endif
>>>>  
>>>>  #else
>>>> @@ -64,5 +70,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
>>>>  {
>>>>  	*targets = NODE_MASK_NONE;
>>>>  }
>>>> +
>>>> +static inline bool node_is_toptier(int node)
>>>> +{
>>>> +	return true;
>>>> +}
>>>>  #endif	/* CONFIG_NUMA */
>>>>  #endif  /* _LINUX_MEMORY_TIERS_H */
>>>> diff --git a/include/linux/node.h b/include/linux/node.h
>>>> index 40d641a8bfb0..9ec680dd607f 100644
>>>> --- a/include/linux/node.h
>>>> +++ b/include/linux/node.h
>>>> @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
>>>>  
>>>>  #define to_node(device) container_of(device, struct node, dev)
>>>>  
>>>> -static inline bool node_is_toptier(int node)
>>>> -{
>>>> -	return node_state(node, N_CPU);
>>>> -}
>>>> -
>>>>  #endif /* _LINUX_NODE_H_ */
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 834f288b3769..8405662646e9 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -35,6 +35,7 @@
>>>>  #include <linux/numa.h>
>>>>  #include <linux/page_owner.h>
>>>>  #include <linux/sched/sysctl.h>
>>>> +#include <linux/memory-tiers.h>
>>>>  
>>>>  #include <asm/tlb.h>
>>>>  #include <asm/pgalloc.h>
>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>> index 84e2be31a853..36d87dc422ab 100644
>>>> --- a/mm/memory-tiers.c
>>>> +++ b/mm/memory-tiers.c
>>>> @@ -30,6 +30,7 @@ static DEFINE_MUTEX(memory_tier_lock);
>>>>  static LIST_HEAD(memory_tiers);
>>>>  struct memory_dev_type *node_memory_types[MAX_NUMNODES];
>>>>  #ifdef CONFIG_MIGRATION
>>>> +static int top_tier_adistance;
>>>>  /*
>>>>   * node_demotion[] examples:
>>>>   *
>>>> @@ -159,6 +160,31 @@ static struct memory_tier *__node_get_memory_tier(int node)
>>>>  }
>>>>  
>>>>  #ifdef CONFIG_MIGRATION
>>>> +bool node_is_toptier(int node)
>>>> +{
>>>> +	bool toptier;
>>>> +	pg_data_t *pgdat;
>>>> +	struct memory_tier *memtier;
>>>> +
>>>> +	pgdat = NODE_DATA(node);
>>>> +	if (!pgdat)
>>>> +		return false;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	memtier = rcu_dereference(pgdat->memtier);
>>>> +	if (!memtier) {
>>>> +		toptier = true;
>>>> +		goto out;
>>>> +	}
>>>> +	if (memtier->adistance_start >= top_tier_adistance)
>>>> +		toptier = true;
>>>> +	else
>>>> +		toptier = false;
>>>> +out:
>>>> +	rcu_read_unlock();
>>>> +	return toptier;
>>>> +}
>>>> +
>>>>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
>>>>  {
>>>>  	struct memory_tier *memtier;
>>>> @@ -315,6 +341,22 @@ static void establish_demotion_targets(void)
>>>>  			}
>>>>  		} while (1);
>>>>  	}
>>>> +	/*
>>>> +	 * Promotion is allowed from a memory tier to higher
>>>> +	 * memory tier only if the memory tier doesn't include
>>>> +	 * compute. We want to  skip promotion from a memory tier,
>>>> +	 * if any node that is  part of the memory tier have CPUs.
>>>> +	 * Once we detect such a memory tier, we consider that tier
>>>> +	 * as top tiper from which promotion on is not allowed.
>>>> +	 */
>>>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>>>> +		tier_nodes = get_memtier_nodemask(memtier);
>>>> +		nodes_and(tier_nodes, node_states[N_CPU], tier_nodes);
>>>> +		if (!nodes_empty(tier_nodes)) {
>>>> +			top_tier_adistance = memtier->adistance_start;
>>>
>>> IMHO, this should be,
>>>
>>> 			top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE;
>>>
>> 
>> Good catch. Will update. BTW i did send v12 version of the patchset already to the list. 
>> 
>>
>
> Checking this again, we consider a node top tier if the node's memtier abstract distance
> satisfy the below.  
>
> 	if (memtier->adistance_start <= top_tier_adistance)
> 		toptier = true;

I admit that this works correctly.  And I think that the following code
is even more correct conceptually.  If so, why not help the code reader
to understand it more easily?

        if (memtier->adistance_start + MEMTIER_CHUNK_SIZE <= top_tier_adistance)
                toptier = true;

Best Regards,
Huang, Ying

> With that we should be good with the current code. But I agree with you that top_tier_distance
> should cover the full range of the top memory tier.
>
> -aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-07-29  7:19     ` Aneesh Kumar K.V
@ 2022-08-01  2:06       ` Huang, Ying
  2022-08-01  4:40         ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  2:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> By default, all nodes are assigned to the default memory tier which
>>> is the memory tier designated for nodes with DRAM
>>>
>>> Set dax kmem device node's tier to slower memory tier by assigning
>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>>> appears below the default memory tier in demotion order.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  drivers/dax/kmem.c           |  9 +++++++++
>>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>> index a37622060fff..6b0d5de9a3e9 100644
>>> --- a/drivers/dax/kmem.c
>>> +++ b/drivers/dax/kmem.c
>>> @@ -11,6 +11,7 @@
>>>  #include <linux/fs.h>
>>>  #include <linux/mm.h>
>>>  #include <linux/mman.h>
>>> +#include <linux/memory-tiers.h>
>>>  #include "dax-private.h"
>>>  #include "bus.h"
>>>  
>>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>>  	struct resource *res[];
>>>  };
>>>  
>>> +static struct memory_dev_type default_pmem_type  = {
>>
>> Why is this named as default_pmem_type?  We will not change the memory
>> type of a node usually.
>>
>
> Any other suggestion? pmem_dev_type? 

Or dax_pmem_type?

DAX is used to enumerate the memory device.

>
>>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>>> +	.nodes  = NODE_MASK_NONE,
>>> +};
>>> +
>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>  {
>>>  	struct device *dev = &dev_dax->dev;
>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>  		return -EINVAL;
>>>  	}
>>>  
>>> +	init_node_memory_type(numa_node, &default_pmem_type);
>>> +
>>
>> The memory hot-add below may fail.  So the error handling needs to be
>> added.
>>
>> And, it appears that the memory type and memory tier of a node may be
>> fully initialized here before NUMA hot-adding started.  So I suggest to
>> set node_memory_types[] here only.  And set memory_dev_type->nodes in
>> node hot-add callback.  I think there is the proper place to complete
>> the initialization.
>>
>> And, in theory dax/kmem.c can be unloaded.  So we need to clear
>> node_memory_types[] for nodes somewhere.
>>
>
> I guess by module exit we can be sure that all the memory managed
> by dax/kmem is hotplugged out. How about something like below?

Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's
natural to clear it in dev_dax_kmem_remove().

Best Regards,
Huang, Ying

> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 6b0d5de9a3e9..eb4e158012a9 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -248,6 +248,7 @@ static void __exit dax_kmem_exit(void)
>  	dax_driver_unregister(&device_dax_kmem_driver);
>  	if (!any_hotremove_failed)
>  		kfree_const(kmem_name);
> +	unregister_memory_type(&default_pmem_type);
>  }
>  
>  MODULE_AUTHOR("Intel Corporation");
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index fc6b7a14da51..8355baf5b8b4 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -31,6 +31,7 @@ struct memory_dev_type {
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> +void unregister_memory_type(struct memory_dev_type *memtype);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node)
>  #define numa_demotion_enabled	false
>  static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
>  {
> +}
> +
> +static inline void unregister_memory_type(struct memory_dev_type *memtype)
> +{
>  
>  }
>  
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 064e0f932795..4d29ebd4c4f3 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -500,6 +500,28 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type)
>  	mutex_unlock(&memory_tier_lock);
>  }
>  
> +void unregister_memory_type(struct memory_dev_type *memtype)
> +{
> +	int node;
> +	struct memory_tier *memtier = memtype->memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	for(node = 0; node < MAX_NUMNODES; node++) {
> +		if (node_memory_types[node] == memtype) {
> +			if (!nodes_empty(memtype->nodes))
> +				WARN_ON(1);
> +			node_memory_types[node] = NULL;
> +		}
> +	}
> +
> +	list_del(&memtype->tier_sibiling);
> +	memtype->memtier = NULL;
> +	if (list_empty(&memtier->memory_types))
> +		destroy_memory_tier(memtier);
> +
> +	mutex_unlock(&memory_tier_lock);
> +}
> +
>  void update_node_adistance(int node, struct memory_dev_type *memtype)
>  {
>  	pg_data_t *pgdat;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-29  7:22     ` Aneesh Kumar K.V
@ 2022-08-01  2:15       ` Huang, Ying
  0 siblings, 0 replies; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  2:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> + */
>
> ....
>
>>> +int next_demotion_node(int node)
>>> +{
>>> +	struct demotion_nodes *nd;
>>> +	int target;
>>> +
>>> +	if (!node_demotion)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	nd = &node_demotion[node];
>>> +
>>> +	/*
>>> +	 * node_demotion[] is updated without excluding this
>>> +	 * function from running.
>>> +	 *
>>> +	 * Make sure to use RCU over entire code blocks if
>>> +	 * node_demotion[] reads need to be consistent.
>>> +	 */
>>> +	rcu_read_lock();
>>> +	/*
>>> +	 * If there are multiple target nodes, just select one
>>> +	 * target node randomly.
>>> +	 *
>>> +	 * In addition, we can also use round-robin to select
>>> +	 * target node, but we should introduce another variable
>>> +	 * for node_demotion[] to record last selected target node,
>>> +	 * that may cause cache ping-pong due to the changing of
>>> +	 * last target node. Or introducing per-cpu data to avoid
>>> +	 * caching issue, which seems more complicated. So selecting
>>> +	 * target node randomly seems better until now.
>>> +	 */
>>> +	target = node_random(&nd->preferred);
>>
>> Don't find code to optimize node_random() for weight == 1 case, forget
>> to do that?
>
> I guess you suggested to do that as the patch for node_random or did I
> got the review feedback wrong?

Yes.

> https://lore.kernel.org/linux-mm/87y1wdn30p.fsf@yhuang6-desk2.ccr.corp.intel.com
>
> The change for node_random will be patch outside this series.

I think we can include it in this series.  Because the series provide
more information about why we need the change.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  2:06       ` Huang, Ying
@ 2022-08-01  4:40         ` Aneesh Kumar K V
  2022-08-01  5:10           ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-01  4:40 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 8/1/22 7:36 AM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> By default, all nodes are assigned to the default memory tier which
>>>> is the memory tier designated for nodes with DRAM
>>>>
>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>>>> appears below the default memory tier in demotion order.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>  drivers/dax/kmem.c           |  9 +++++++++
>>>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>> index a37622060fff..6b0d5de9a3e9 100644
>>>> --- a/drivers/dax/kmem.c
>>>> +++ b/drivers/dax/kmem.c
>>>> @@ -11,6 +11,7 @@
>>>>  #include <linux/fs.h>
>>>>  #include <linux/mm.h>
>>>>  #include <linux/mman.h>
>>>> +#include <linux/memory-tiers.h>
>>>>  #include "dax-private.h"
>>>>  #include "bus.h"
>>>>  
>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>>>  	struct resource *res[];
>>>>  };
>>>>  
>>>> +static struct memory_dev_type default_pmem_type  = {
>>>
>>> Why is this named as default_pmem_type?  We will not change the memory
>>> type of a node usually.
>>>
>>
>> Any other suggestion? pmem_dev_type? 
> 
> Or dax_pmem_type?
> 
> DAX is used to enumerate the memory device.
> 
>>
>>>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>>>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>>>> +	.nodes  = NODE_MASK_NONE,
>>>> +};
>>>> +
>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>  {
>>>>  	struct device *dev = &dev_dax->dev;
>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>  		return -EINVAL;
>>>>  	}
>>>>  
>>>> +	init_node_memory_type(numa_node, &default_pmem_type);
>>>> +
>>>
>>> The memory hot-add below may fail.  So the error handling needs to be
>>> added.
>>>
>>> And, it appears that the memory type and memory tier of a node may be
>>> fully initialized here before NUMA hot-adding started.  So I suggest to
>>> set node_memory_types[] here only.  And set memory_dev_type->nodes in
>>> node hot-add callback.  I think there is the proper place to complete
>>> the initialization.
>>>
>>> And, in theory dax/kmem.c can be unloaded.  So we need to clear
>>> node_memory_types[] for nodes somewhere.
>>>
>>
>> I guess by module exit we can be sure that all the memory managed
>> by dax/kmem is hotplugged out. How about something like below?
> 
> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's
> natural to clear it in dev_dax_kmem_remove().
> 

Most of required reset/clear is done as part of memory hotunplug. So
if we did manage to successfully unplug the memory, everything except
node_memory_types[node] should be reset. That makes the clear_node_memory_type
the below. 

void clear_node_memory_type(int node, struct memory_dev_type *memtype)
{

	mutex_lock(&memory_tier_lock);
	/*
	 * memory unplug did clear the node from the memtype and
	 * dax/kem did initialize this node's memory type.
	 */
	if (!node_isset(node, memtype->nodes) && node_memory_types[node]  == memtype){
		node_memory_types[node] = NULL;
	}
	mutex_unlock(&memory_tier_lock);
}

With the module unload, it is kind of force removing the usage of the specific memtype.
Considering module unload will remove the usage of specific memtype from other parts
of the kernel and we already do all the required reset in memory hot unplug, do we
need to do the clear_node_memory_type above? 

-aneesh




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  4:40         ` Aneesh Kumar K V
@ 2022-08-01  5:10           ` Huang, Ying
  2022-08-01  5:38             ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  5:10 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/1/22 7:36 AM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> By default, all nodes are assigned to the default memory tier which
>>>>> is the memory tier designated for nodes with DRAM
>>>>>
>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>>>>> appears below the default memory tier in demotion order.
>>>>>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>>  drivers/dax/kmem.c           |  9 +++++++++
>>>>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>>>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>>>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>> index a37622060fff..6b0d5de9a3e9 100644
>>>>> --- a/drivers/dax/kmem.c
>>>>> +++ b/drivers/dax/kmem.c
>>>>> @@ -11,6 +11,7 @@
>>>>>  #include <linux/fs.h>
>>>>>  #include <linux/mm.h>
>>>>>  #include <linux/mman.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>>  #include "dax-private.h"
>>>>>  #include "bus.h"
>>>>>  
>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>>>>  	struct resource *res[];
>>>>>  };
>>>>>  
>>>>> +static struct memory_dev_type default_pmem_type  = {
>>>>
>>>> Why is this named as default_pmem_type?  We will not change the memory
>>>> type of a node usually.
>>>>
>>>
>>> Any other suggestion? pmem_dev_type? 
>> 
>> Or dax_pmem_type?
>> 
>> DAX is used to enumerate the memory device.
>> 
>>>
>>>>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>>>>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>>>>> +	.nodes  = NODE_MASK_NONE,
>>>>> +};
>>>>> +
>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>  {
>>>>>  	struct device *dev = &dev_dax->dev;
>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>  		return -EINVAL;
>>>>>  	}
>>>>>  
>>>>> +	init_node_memory_type(numa_node, &default_pmem_type);
>>>>> +
>>>>
>>>> The memory hot-add below may fail.  So the error handling needs to be
>>>> added.
>>>>
>>>> And, it appears that the memory type and memory tier of a node may be
>>>> fully initialized here before NUMA hot-adding started.  So I suggest to
>>>> set node_memory_types[] here only.  And set memory_dev_type->nodes in
>>>> node hot-add callback.  I think there is the proper place to complete
>>>> the initialization.
>>>>
>>>> And, in theory dax/kmem.c can be unloaded.  So we need to clear
>>>> node_memory_types[] for nodes somewhere.
>>>>
>>>
>>> I guess by module exit we can be sure that all the memory managed
>>> by dax/kmem is hotplugged out. How about something like below?
>> 
>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's
>> natural to clear it in dev_dax_kmem_remove().
>> 
>
> Most of required reset/clear is done as part of memory hotunplug. So
> if we did manage to successfully unplug the memory, everything except
> node_memory_types[node] should be reset. That makes the clear_node_memory_type
> the below. 
>
> void clear_node_memory_type(int node, struct memory_dev_type *memtype)
> {
>
> 	mutex_lock(&memory_tier_lock);
> 	/*
> 	 * memory unplug did clear the node from the memtype and
> 	 * dax/kem did initialize this node's memory type.
> 	 */
> 	if (!node_isset(node, memtype->nodes) && node_memory_types[node]  == memtype){
> 		node_memory_types[node] = NULL;
> 	}
> 	mutex_unlock(&memory_tier_lock);
> }
>
> With the module unload, it is kind of force removing the usage of the specific memtype.
> Considering module unload will remove the usage of specific memtype from other parts
> of the kernel and we already do all the required reset in memory hot unplug, do we
> need to do the clear_node_memory_type above? 

Per my understanding, we need to call clear_node_memory_type() in
dev_dax_kmem_remove().  After that, we have nothing to do in
dax_kmem_exit().

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  5:10           ` Huang, Ying
@ 2022-08-01  5:38             ` Aneesh Kumar K V
  2022-08-01  6:37               ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-01  5:38 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 8/1/22 10:40 AM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> By default, all nodes are assigned to the default memory tier which
>>>>>> is the memory tier designated for nodes with DRAM
>>>>>>
>>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>>>>>> appears below the default memory tier in demotion order.
>>>>>>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> ---
>>>>>>  drivers/dax/kmem.c           |  9 +++++++++
>>>>>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>>>>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>>>>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>> index a37622060fff..6b0d5de9a3e9 100644
>>>>>> --- a/drivers/dax/kmem.c
>>>>>> +++ b/drivers/dax/kmem.c
>>>>>> @@ -11,6 +11,7 @@
>>>>>>  #include <linux/fs.h>
>>>>>>  #include <linux/mm.h>
>>>>>>  #include <linux/mman.h>
>>>>>> +#include <linux/memory-tiers.h>
>>>>>>  #include "dax-private.h"
>>>>>>  #include "bus.h"
>>>>>>  
>>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>>>>>  	struct resource *res[];
>>>>>>  };
>>>>>>  
>>>>>> +static struct memory_dev_type default_pmem_type  = {
>>>>>
>>>>> Why is this named as default_pmem_type?  We will not change the memory
>>>>> type of a node usually.
>>>>>
>>>>
>>>> Any other suggestion? pmem_dev_type? 
>>>
>>> Or dax_pmem_type?
>>>
>>> DAX is used to enumerate the memory device.
>>>
>>>>
>>>>>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>>>>>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>>>>>> +	.nodes  = NODE_MASK_NONE,
>>>>>> +};
>>>>>> +
>>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>  {
>>>>>>  	struct device *dev = &dev_dax->dev;
>>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>  		return -EINVAL;
>>>>>>  	}
>>>>>>  
>>>>>> +	init_node_memory_type(numa_node, &default_pmem_type);
>>>>>> +
>>>>>
>>>>> The memory hot-add below may fail.  So the error handling needs to be
>>>>> added.
>>>>>
>>>>> And, it appears that the memory type and memory tier of a node may be
>>>>> fully initialized here before NUMA hot-adding started.  So I suggest to
>>>>> set node_memory_types[] here only.  And set memory_dev_type->nodes in
>>>>> node hot-add callback.  I think there is the proper place to complete
>>>>> the initialization.
>>>>>
>>>>> And, in theory dax/kmem.c can be unloaded.  So we need to clear
>>>>> node_memory_types[] for nodes somewhere.
>>>>>
>>>>
>>>> I guess by module exit we can be sure that all the memory managed
>>>> by dax/kmem is hotplugged out. How about something like below?
>>>
>>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's
>>> natural to clear it in dev_dax_kmem_remove().
>>>
>>
>> Most of required reset/clear is done as part of memory hotunplug. So
>> if we did manage to successfully unplug the memory, everything except
>> node_memory_types[node] should be reset. That makes the clear_node_memory_type
>> the below. 
>>
>> void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>> {
>>
>> 	mutex_lock(&memory_tier_lock);
>> 	/*
>> 	 * memory unplug did clear the node from the memtype and
>> 	 * dax/kem did initialize this node's memory type.
>> 	 */
>> 	if (!node_isset(node, memtype->nodes) && node_memory_types[node]  == memtype){
>> 		node_memory_types[node] = NULL;
>> 	}
>> 	mutex_unlock(&memory_tier_lock);
>> }
>>
>> With the module unload, it is kind of force removing the usage of the specific memtype.
>> Considering module unload will remove the usage of specific memtype from other parts
>> of the kernel and we already do all the required reset in memory hot unplug, do we
>> need to do the clear_node_memory_type above? 
> 
> Per my understanding, we need to call clear_node_memory_type() in
> dev_dax_kmem_remove().  After that, we have nothing to do in
> dax_kmem_exit().
> 

Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 
Should we also rebuild demotion order? On a successful memory remove we do rebuild demotion order.
This is what i ended up with.

modified   drivers/dax/kmem.c
@@ -171,6 +171,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
 	int i, success = 0;
+	int node = dev_dax->target_node;
 	struct device *dev = &dev_dax->dev;
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
 
@@ -208,6 +209,12 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 		kfree(data);
 		dev_set_drvdata(dev, NULL);
 	}
+	/*
+	 * Clear the memtype association, even if the memory
+	 * remove failed.
+	 */
+	clear_node_memory_type(node, dax_pmem_type);
+
 }
 #else
 static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
modified   include/linux/memory-tiers.h
@@ -31,6 +31,7 @@ struct memory_dev_type {
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
+void clear_node_memory_type(int node, struct memory_dev_type *memtype);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node)
 #define numa_demotion_enabled	false
 static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
 {
+}
+
+static inline void unregister_memory_type(struct memory_dev_type *memtype)
+{
 
 }
 
modified   mm/memory-tiers.c
@@ -501,6 +501,36 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type)
 }
 EXPORT_SYMBOL_GPL(init_node_memory_type);
 
+void clear_node_memory_type(int node, struct memory_dev_type *memtype)
+{
+	struct memory_tier *memtier;
+
+	mutex_lock(&memory_tier_lock);
+	/*
+	 * Even if we fail to unplug memory, clear the association of
+	 * this node to this specific memory type.
+	 */
+	if (node_memory_types[node] == memtype) {
+
+		memtier = __node_get_memory_tier(node);
+		if (memtier) {
+			rcu_assign_pointer(pgdat->memtier, NULL);
+			synchronize_rcu();
+		}
+		node_clear(node, memtype->nodes);
+		if (nodes_empty(memtype->nodes)) {
+			list_del(&memtype->tier_sibiling);
+			memtype->memtier = NULL;
+			if (current_memtier && list_empty(&current_memtier->memory_types))
+				destroy_memory_tier(current_memtier);
+
+		}
+		node_memory_types[node] = NULL;
+	}
+	mutex_unlock(&memory_tier_lock);
+}
+EXPORT_SYMBOL_GPL(init_node_memory_type);
+
 void update_node_adistance(int node, struct memory_dev_type *memtype)
 {
 	pg_data_t *pgdat;

[back



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  5:38             ` Aneesh Kumar K V
@ 2022-08-01  6:37               ` Huang, Ying
  2022-08-01  6:55                 ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  6:37 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/1/22 10:40 AM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> By default, all nodes are assigned to the default memory tier which
>>>>>>> is the memory tier designated for nodes with DRAM
>>>>>>>
>>>>>>> Set dax kmem device node's tier to slower memory tier by assigning
>>>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier
>>>>>>> appears below the default memory tier in demotion order.
>>>>>>>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> ---
>>>>>>>  drivers/dax/kmem.c           |  9 +++++++++
>>>>>>>  include/linux/memory-tiers.h | 19 ++++++++++++++++++-
>>>>>>>  mm/memory-tiers.c            | 28 ++++++++++++++++------------
>>>>>>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>>> index a37622060fff..6b0d5de9a3e9 100644
>>>>>>> --- a/drivers/dax/kmem.c
>>>>>>> +++ b/drivers/dax/kmem.c
>>>>>>> @@ -11,6 +11,7 @@
>>>>>>>  #include <linux/fs.h>
>>>>>>>  #include <linux/mm.h>
>>>>>>>  #include <linux/mman.h>
>>>>>>> +#include <linux/memory-tiers.h>
>>>>>>>  #include "dax-private.h"
>>>>>>>  #include "bus.h"
>>>>>>>  
>>>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data {
>>>>>>>  	struct resource *res[];
>>>>>>>  };
>>>>>>>  
>>>>>>> +static struct memory_dev_type default_pmem_type  = {
>>>>>>
>>>>>> Why is this named as default_pmem_type?  We will not change the memory
>>>>>> type of a node usually.
>>>>>>
>>>>>
>>>>> Any other suggestion? pmem_dev_type? 
>>>>
>>>> Or dax_pmem_type?
>>>>
>>>> DAX is used to enumerate the memory device.
>>>>
>>>>>
>>>>>>> +	.adistance = MEMTIER_ADISTANCE_PMEM,
>>>>>>> +	.tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling),
>>>>>>> +	.nodes  = NODE_MASK_NONE,
>>>>>>> +};
>>>>>>> +
>>>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>>  {
>>>>>>>  	struct device *dev = &dev_dax->dev;
>>>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>>  		return -EINVAL;
>>>>>>>  	}
>>>>>>>  
>>>>>>> +	init_node_memory_type(numa_node, &default_pmem_type);
>>>>>>> +
>>>>>>
>>>>>> The memory hot-add below may fail.  So the error handling needs to be
>>>>>> added.
>>>>>>
>>>>>> And, it appears that the memory type and memory tier of a node may be
>>>>>> fully initialized here before NUMA hot-adding started.  So I suggest to
>>>>>> set node_memory_types[] here only.  And set memory_dev_type->nodes in
>>>>>> node hot-add callback.  I think there is the proper place to complete
>>>>>> the initialization.
>>>>>>
>>>>>> And, in theory dax/kmem.c can be unloaded.  So we need to clear
>>>>>> node_memory_types[] for nodes somewhere.
>>>>>>
>>>>>
>>>>> I guess by module exit we can be sure that all the memory managed
>>>>> by dax/kmem is hotplugged out. How about something like below?
>>>>
>>>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's
>>>> natural to clear it in dev_dax_kmem_remove().
>>>>
>>>
>>> Most of required reset/clear is done as part of memory hotunplug. So
>>> if we did manage to successfully unplug the memory, everything except
>>> node_memory_types[node] should be reset. That makes the clear_node_memory_type
>>> the below. 
>>>
>>> void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>>> {
>>>
>>> 	mutex_lock(&memory_tier_lock);
>>> 	/*
>>> 	 * memory unplug did clear the node from the memtype and
>>> 	 * dax/kem did initialize this node's memory type.
>>> 	 */
>>> 	if (!node_isset(node, memtype->nodes) && node_memory_types[node]  == memtype){
>>> 		node_memory_types[node] = NULL;
>>> 	}
>>> 	mutex_unlock(&memory_tier_lock);
>>> }
>>>
>>> With the module unload, it is kind of force removing the usage of the specific memtype.
>>> Considering module unload will remove the usage of specific memtype from other parts
>>> of the kernel and we already do all the required reset in memory hot unplug, do we
>>> need to do the clear_node_memory_type above? 
>> 
>> Per my understanding, we need to call clear_node_memory_type() in
>> dev_dax_kmem_remove().  After that, we have nothing to do in
>> dax_kmem_exit().
>> 
>
> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 

Can we use node_memory_types[] to indicate whether a node is managed by
a driver?

Regardless being succeeded or failed, dev_dax_kmem_remove() will set
node_memory_types[] = NULL.  But until node is offlined, we will still
keep the node in the memory_dev_type (dax_pmem_type).

And we will prevent dax/kmem from unloading via try_module_get() and add
"struct module *" to struct memory_dev_type.

Best Regards,
Huang, Ying

> Should we also rebuild demotion order? On a successful memory remove we do rebuild demotion order.
> This is what i ended up with.
>
> modified   drivers/dax/kmem.c
> @@ -171,6 +171,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>  {
>  	int i, success = 0;
> +	int node = dev_dax->target_node;
>  	struct device *dev = &dev_dax->dev;
>  	struct dax_kmem_data *data = dev_get_drvdata(dev);
>  
> @@ -208,6 +209,12 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>  		kfree(data);
>  		dev_set_drvdata(dev, NULL);
>  	}
> +	/*
> +	 * Clear the memtype association, even if the memory
> +	 * remove failed.
> +	 */
> +	clear_node_memory_type(node, dax_pmem_type);
> +
>  }
>  #else
>  static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
> modified   include/linux/memory-tiers.h
> @@ -31,6 +31,7 @@ struct memory_dev_type {
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> +void clear_node_memory_type(int node, struct memory_dev_type *memtype);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node)
>  #define numa_demotion_enabled	false
>  static inline void init_node_memory_type(int node, struct memory_dev_type *default_type)
>  {
> +}
> +
> +static inline void unregister_memory_type(struct memory_dev_type *memtype)
> +{
>  
>  }
>  
> modified   mm/memory-tiers.c
> @@ -501,6 +501,36 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type)
>  }
>  EXPORT_SYMBOL_GPL(init_node_memory_type);
>  
> +void clear_node_memory_type(int node, struct memory_dev_type *memtype)
> +{
> +	struct memory_tier *memtier;
> +
> +	mutex_lock(&memory_tier_lock);
> +	/*
> +	 * Even if we fail to unplug memory, clear the association of
> +	 * this node to this specific memory type.
> +	 */
> +	if (node_memory_types[node] == memtype) {
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (memtier) {
> +			rcu_assign_pointer(pgdat->memtier, NULL);
> +			synchronize_rcu();
> +		}
> +		node_clear(node, memtype->nodes);
> +		if (nodes_empty(memtype->nodes)) {
> +			list_del(&memtype->tier_sibiling);
> +			memtype->memtier = NULL;
> +			if (current_memtier && list_empty(&current_memtier->memory_types))
> +				destroy_memory_tier(current_memtier);
> +
> +		}
> +		node_memory_types[node] = NULL;
> +	}
> +	mutex_unlock(&memory_tier_lock);
> +}
> +EXPORT_SYMBOL_GPL(init_node_memory_type);
> +
>  void update_node_adistance(int node, struct memory_dev_type *memtype)
>  {
>  	pg_data_t *pgdat;
>
> [back

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  6:37               ` Huang, Ying
@ 2022-08-01  6:55                 ` Aneesh Kumar K V
  2022-08-01  7:13                   ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-01  6:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 8/1/22 12:07 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 8/1/22 10:40 AM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

....

>>>>
>>>> With the module unload, it is kind of force removing the usage of the specific memtype.
>>>> Considering module unload will remove the usage of specific memtype from other parts
>>>> of the kernel and we already do all the required reset in memory hot unplug, do we
>>>> need to do the clear_node_memory_type above? 
>>>
>>> Per my understanding, we need to call clear_node_memory_type() in
>>> dev_dax_kmem_remove().  After that, we have nothing to do in
>>> dax_kmem_exit().
>>>
>>
>> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 
> 
> Can we use node_memory_types[] to indicate whether a node is managed by
> a driver?
> 
> Regardless being succeeded or failed, dev_dax_kmem_remove() will set
> node_memory_types[] = NULL.  But until node is offlined, we will still
> keep the node in the memory_dev_type (dax_pmem_type).
> 
> And we will prevent dax/kmem from unloading via try_module_get() and add
> "struct module *" to struct memory_dev_type.
> 

Current dax/kmem driver is not holding any module reference and allows the module to be unloaded
anytime. Even if the memory onlined by the driver fails to be unplugged. Addition of memory_dev_type
as suggested by you will be different than that. Page demotion can continue to work without the
support of dax_pmem_type as long as we keep the older demotion order. Any new demotion order
rebuild will remove the the memory node which was not hotunplugged  from the demotion order. Isn't that
a much simpler implementation? 

-aneesh


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  6:55                 ` Aneesh Kumar K V
@ 2022-08-01  7:13                   ` Huang, Ying
  2022-08-01  7:41                     ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-08-01  7:13 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/1/22 12:07 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 8/1/22 10:40 AM, Huang, Ying wrote:
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>
>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>>>>>
>>>>> With the module unload, it is kind of force removing the usage of the specific memtype.
>>>>> Considering module unload will remove the usage of specific memtype from other parts
>>>>> of the kernel and we already do all the required reset in memory hot unplug, do we
>>>>> need to do the clear_node_memory_type above? 
>>>>
>>>> Per my understanding, we need to call clear_node_memory_type() in
>>>> dev_dax_kmem_remove().  After that, we have nothing to do in
>>>> dax_kmem_exit().
>>>>
>>>
>>> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 
>> 
>> Can we use node_memory_types[] to indicate whether a node is managed by
>> a driver?
>> 
>> Regardless being succeeded or failed, dev_dax_kmem_remove() will set
>> node_memory_types[] = NULL.  But until node is offlined, we will still
>> keep the node in the memory_dev_type (dax_pmem_type).
>> 
>> And we will prevent dax/kmem from unloading via try_module_get() and add
>> "struct module *" to struct memory_dev_type.
>> 
>
> Current dax/kmem driver is not holding any module reference and allows the module to be unloaded
> anytime. Even if the memory onlined by the driver fails to be unplugged. Addition of memory_dev_type
> as suggested by you will be different than that. Page demotion can continue to work without the
> support of dax_pmem_type as long as we keep the older demotion order. Any new demotion order
> rebuild will remove the the memory node which was not hotunplugged  from the demotion order. Isn't that
> a much simpler implementation? 

Per my understanding, unbinding/binding the dax/kmem driver means
changing the memory type of a memory device.  For example, unbinding
dax/kmem driver may mean changing the memory type from dax_pmem_type to
default_memory_type (or default_dram_type).  That appears strange.  But
if we force the NUMA node to be offlined for unbinding, we can avoid to
change the memory type to default_memory_type.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  7:13                   ` Huang, Ying
@ 2022-08-01  7:41                     ` Aneesh Kumar K V
  2022-08-02  1:58                       ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-01  7:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

On 8/1/22 12:43 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 8/1/22 12:07 PM, Huang, Ying wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> On 8/1/22 10:40 AM, Huang, Ying wrote:
>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>
>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>
>>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>> ....
>>
>>>>>>
>>>>>> With the module unload, it is kind of force removing the usage of the specific memtype.
>>>>>> Considering module unload will remove the usage of specific memtype from other parts
>>>>>> of the kernel and we already do all the required reset in memory hot unplug, do we
>>>>>> need to do the clear_node_memory_type above? 
>>>>>
>>>>> Per my understanding, we need to call clear_node_memory_type() in
>>>>> dev_dax_kmem_remove().  After that, we have nothing to do in
>>>>> dax_kmem_exit().
>>>>>
>>>>
>>>> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 
>>>
>>> Can we use node_memory_types[] to indicate whether a node is managed by
>>> a driver?
>>>
>>> Regardless being succeeded or failed, dev_dax_kmem_remove() will set
>>> node_memory_types[] = NULL.  But until node is offlined, we will still
>>> keep the node in the memory_dev_type (dax_pmem_type).
>>>
>>> And we will prevent dax/kmem from unloading via try_module_get() and add
>>> "struct module *" to struct memory_dev_type.
>>>
>>
>> Current dax/kmem driver is not holding any module reference and allows the module to be unloaded
>> anytime. Even if the memory onlined by the driver fails to be unplugged. Addition of memory_dev_type
>> as suggested by you will be different than that. Page demotion can continue to work without the
>> support of dax_pmem_type as long as we keep the older demotion order. Any new demotion order
>> rebuild will remove the the memory node which was not hotunplugged  from the demotion order. Isn't that
>> a much simpler implementation? 
> 
> Per my understanding, unbinding/binding the dax/kmem driver means
> changing the memory type of a memory device.  For example, unbinding
> dax/kmem driver may mean changing the memory type from dax_pmem_type to
> default_memory_type (or default_dram_type).  That appears strange.  But
> if we force the NUMA node to be offlined for unbinding, we can avoid to
> change the memory type to default_memory_type.
> 

If we are able to unplug all the memory, we do remove the node from N_MEMORY.
If we fail to unplug the memory, we have two options. 

1) Keep the same demotion order
2) Rebuild the demotion order which results in memory NUMA node not participating
   in demotion. 

I agree with you that we should not switch to default memory type. 

The below code demonstrate how it can be done. If we want to keep
the same demotion order, we can remove establish_demotion_target() from
the below code. 

void clear_node_memory_type(int node, struct memory_dev_type *memtype)
{
	struct memory_tier *memtier;
	pg_data_t *pgdat = NODE_DATA(node);

	mutex_lock(&memory_tier_lock);
	/*
	 * Even if we fail to unplug memory, clear the association of
	 * this node to this specific memory type.
	 */
	if (node_isset(node, memtype->nodes) && node_memory_types[node] == memtype) {

		memtier = __node_get_memory_tier(node);
		if (memtier) {
			rcu_assign_pointer(pgdat->memtier, NULL);
			synchronize_rcu();
		}
		node_clear(node, memtype->nodes);
		if (nodes_empty(memtype->nodes)) {
			list_del(&memtype->tier_sibiling);
			memtype->memtier = NULL;
			if (memtier && list_empty(&memtier->memory_types))
				destroy_memory_tier(memtier);

		}
		establish_demotion_targets();
	}
	node_memory_types[node] = NULL;
	mutex_unlock(&memory_tier_lock);
}


If we agree that we want to keep the current behavior (that is to allow kmem driver unload
even on memory unplug failure) we can go with the above change. If we are suggesting we
should prevent a driver unload, then IMHO it should be independent of memory_dev_type
(or this patch series). We should make sure we take a module reference on successful
memory online and drop it only on successful offline. 


-aneesh



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM
  2022-08-01  7:41                     ` Aneesh Kumar K V
@ 2022-08-02  1:58                       ` Huang, Ying
  0 siblings, 0 replies; 36+ messages in thread
From: Huang, Ying @ 2022-08-02  1:58 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/1/22 12:43 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 8/1/22 12:07 PM, Huang, Ying wrote:
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> On 8/1/22 10:40 AM, Huang, Ying wrote:
>>>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>>>>
>>>>>>> On 8/1/22 7:36 AM, Huang, Ying wrote:
>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>>>>
>>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>>
>>>>>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>> ....
>>>
>>>>>>>
>>>>>>> With the module unload, it is kind of force removing the usage of the specific memtype.
>>>>>>> Considering module unload will remove the usage of specific memtype from other parts
>>>>>>> of the kernel and we already do all the required reset in memory hot unplug, do we
>>>>>>> need to do the clear_node_memory_type above? 
>>>>>>
>>>>>> Per my understanding, we need to call clear_node_memory_type() in
>>>>>> dev_dax_kmem_remove().  After that, we have nothing to do in
>>>>>> dax_kmem_exit().
>>>>>>
>>>>>
>>>>> Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. 
>>>>
>>>> Can we use node_memory_types[] to indicate whether a node is managed by
>>>> a driver?
>>>>
>>>> Regardless being succeeded or failed, dev_dax_kmem_remove() will set
>>>> node_memory_types[] = NULL.  But until node is offlined, we will still
>>>> keep the node in the memory_dev_type (dax_pmem_type).
>>>>
>>>> And we will prevent dax/kmem from unloading via try_module_get() and add
>>>> "struct module *" to struct memory_dev_type.
>>>>
>>>
>>> Current dax/kmem driver is not holding any module reference and allows the module to be unloaded
>>> anytime. Even if the memory onlined by the driver fails to be unplugged. Addition of memory_dev_type
>>> as suggested by you will be different than that. Page demotion can continue to work without the
>>> support of dax_pmem_type as long as we keep the older demotion order. Any new demotion order
>>> rebuild will remove the the memory node which was not hotunplugged  from the demotion order. Isn't that
>>> a much simpler implementation? 
>> 
>> Per my understanding, unbinding/binding the dax/kmem driver means
>> changing the memory type of a memory device.  For example, unbinding
>> dax/kmem driver may mean changing the memory type from dax_pmem_type to
>> default_memory_type (or default_dram_type).  That appears strange.  But
>> if we force the NUMA node to be offlined for unbinding, we can avoid to
>> change the memory type to default_memory_type.
>> 
>
> If we are able to unplug all the memory, we do remove the node from N_MEMORY.
> If we fail to unplug the memory, we have two options. 
>
> 1) Keep the same demotion order
> 2) Rebuild the demotion order which results in memory NUMA node not participating
>    in demotion. 
>
> I agree with you that we should not switch to default memory type. 
>
> The below code demonstrate how it can be done. If we want to keep
> the same demotion order, we can remove establish_demotion_target() from
> the below code. 
>
> void clear_node_memory_type(int node, struct memory_dev_type *memtype)
> {
> 	struct memory_tier *memtier;
> 	pg_data_t *pgdat = NODE_DATA(node);
>
> 	mutex_lock(&memory_tier_lock);
> 	/*
> 	 * Even if we fail to unplug memory, clear the association of
> 	 * this node to this specific memory type.
> 	 */
> 	if (node_isset(node, memtype->nodes) && node_memory_types[node] == memtype) {
>
> 		memtier = __node_get_memory_tier(node);
> 		if (memtier) {
> 			rcu_assign_pointer(pgdat->memtier, NULL);
> 			synchronize_rcu();
> 		}
> 		node_clear(node, memtype->nodes);
> 		if (nodes_empty(memtype->nodes)) {
> 			list_del(&memtype->tier_sibiling);
> 			memtype->memtier = NULL;
> 			if (memtier && list_empty(&memtier->memory_types))
> 				destroy_memory_tier(memtier);
>
> 		}
> 		establish_demotion_targets();
> 	}
> 	node_memory_types[node] = NULL;
> 	mutex_unlock(&memory_tier_lock);
> }
>
>
> If we agree that we want to keep the current behavior (that is to allow kmem driver unload
> even on memory unplug failure) we can go with the above change. If we are suggesting we
> should prevent a driver unload, then IMHO it should be independent of memory_dev_type
> (or this patch series). We should make sure we take a module reference on successful
> memory online and drop it only on successful offline. 

I suggest to keep a NUMA node in the memory_dev_type (dax_pmem_type)
until the node is offlined.

Yes.  The dax/kmem driver may be unbound to the dax device even if
memory offlining fails.  But we can still find someway to keep
the memory_dev_type of the NUMA node unchanged.

Solution 1 is to prevent dax/kmem driver from unloading via module
reference.  I think we do that in this series.

Solution 2 is to allocate dax_pmem_type dynamically, and manage it like
"kmem_name".  We may need some kind of reference counting to manage it.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
       [not found]       ` <62e89c9addcc_62c2a29443@dwillia2-xfh.jf.intel.com.notmuch>
@ 2022-08-02  5:03         ` Aneesh Kumar K V
       [not found]         ` <87zggni13h.fsf@yhuang6-desk2.ccr.corp.intel.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-02  5:03 UTC (permalink / raw)
  To: Dan Williams, Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Johannes Weiner,
	jvgediya.oss, Jagdish Gediya

On 8/2/22 9:10 AM, Dan Williams wrote:
> Huang, Ying wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>> Aneesh Kumar K.V wrote:
>>>> In the current kernel, memory tiers are defined implicitly via a demotion path
>>>> relationship between NUMA nodes, which is created during the kernel
>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>>>> current implementation puts all nodes with CPU into the highest tier, and builds
>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>>>> based on the distances between nodes.
>>>>
>>>> This current memory tier kernel implementation needs to be improved for several
>>>> important use cases,
>>>>
>>>> The current tier initialization code always initializes each memory-only NUMA
>>>> node into a lower tier. But a memory-only NUMA node may have a high performance
>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>>>> should be put into a higher tier.
>>>>
>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>>>> the next lower tier.
>>>>
>>>> With current kernel higher tier node can only be demoted to nodes with shortest
>>>> distance on the next lower tier as defined by the demotion path, not any other
>>>> node from any lower tier. This strict, demotion order does not work in all use
>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>>>> node in the same demotion tier as a fallback when the preferred demotion node is
>>>> out of space), This demotion order is also inconsistent with the page allocation
>>>> fallback order when all the nodes in a higher tier are out of space: The page
>>>> allocation can fall back to any node from any lower tier, whereas the demotion
>>>> order doesn't allow that.
>>>>
>>>> This patch series address the above by defining memory tiers explicitly.
>>>>
>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>>>> a specific type. The memory type of a device is represented by its abstract
>>>> distance. A memory tier corresponds to a range of abstract distance. This allows
>>>> for classifying memory devices with a specific performance range into a memory
>>>> tier.
>>>>
>>>> This patch configures the range/chunk size to be 128. The default DRAM
>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>>>> Slower memory devices like persistent memory will have abstract distance below
>>>> the default DRAM level and hence will be placed in these 4 lower tiers.
>>>>
>>>> A kernel parameter is provided to override the default memory tier.
>>>>
>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>
>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>  include/linux/memory-tiers.h |  17 ++++++
>>>>  mm/Makefile                  |   1 +
>>>>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 120 insertions(+)
>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>  create mode 100644 mm/memory-tiers.c
>>>>
>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>> new file mode 100644
>>>> index 000000000000..8d7884b7a3f0
>>>> --- /dev/null
>>>> +++ b/include/linux/memory-tiers.h
>>>> @@ -0,0 +1,17 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>> +#define _LINUX_MEMORY_TIERS_H
>>>> +
>>>> +/*
>>>> + * Each tier cover a abstrace distance chunk size of 128
>>>> + */
>>>> +#define MEMTIER_CHUNK_BITS	7
>>>> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>>>> +/*
>>>> + * For now let's have 4 memory tier below default DRAM tier.
>>>> + */
>>>> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>> +/* leave one tier below this slow pmem */
>>>> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>
>>> Why is memory type encoded in these values? There is no reason to
>>> believe that PMEM is of a lower performance tier than DRAM. Consider
>>> high performance energy backed DRAM that makes it "PMEM", consider CXL
>>> attached DRAM over a switch topology and constrained links that makes it
>>> a lower performance tier than locally attached DRAM. The names should be
>>> associated with tiers that indicate their usage. Something like HOT,
>>> GENERAL, and COLD. Where, for example, HOT is low capacity high
>>> performance compared to the general purpose pool, and COLD is high
>>> capacity low performance intended to offload the general purpose tier.
>>>
>>> It does not need to be exactly that ontology, but please try to not
>>> encode policy meaning behind memory types. There has been explicit
>>> effort to avoid that to date because types are fraught for declaring
>>> relative performance characteristics, and the relative performance
>>> changes based on what memory types are assembled in a given system.
>>
>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>> used in this very first version to make it as simple as possible.  
> 
> I am failing to see the simplicity of using names that convey a
> performance contract that are invalid depending on the system.
> 
>> I think we can come up with something better in the later version.
>> For example, identify the abstract distance of a PMEM device based on
>> HMAT, etc. 
> 
> Memory tiering has nothing to do with persistence why is PMEM in the
> name at all?
> 

How about

MEMTIER_DEFAULT_DAX_ADISTANCE with a comment there explaining if low level drivers don't
initialize a memory_dev_type for a device/NUMA node, dax/kmem will consider the node
slower than DRAM?


>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>  used now, not for all PMEM.
> 
> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
> is also nothing PMEM specific about the device-dax subsystem.
> 
>> In the current design, memory type is used to report the performance of
>> the hardware, in terms of abstract distance, per Johannes' suggestion.
> 
> That sounds fine, just pick an abstract name, not an explicit memory
> type.
> 
>> Which is an abstraction of memory latency and bandwidth.  Policy is
>> described via memory tiers.  Several memory types may be put in one
>> memory tier.  The abstract distance chunk size of the memory tier may
>> be adjusted according to policy.
> 
> That part all sounds good. That said, I do not see the benefit of
> waiting to run away from these inadequate names.

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
       [not found]         ` <87zggni13h.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2022-08-02  9:34           ` Aneesh Kumar K V
  2022-08-04  0:56             ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-02  9:34 UTC (permalink / raw)
  To: Huang, Ying, Dan Williams
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Johannes Weiner,
	jvgediya.oss, Jagdish Gediya

On 8/2/22 12:27 PM, Huang, Ying wrote:
> Dan Williams <dan.j.williams@intel.com> writes:
> 
>> Huang, Ying wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>> Aneesh Kumar K.V wrote:
>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path
>>>>> relationship between NUMA nodes, which is created during the kernel
>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>>>>> current implementation puts all nodes with CPU into the highest tier, and builds
>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>>>>> based on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel implementation needs to be improved for several
>>>>> important use cases,
>>>>>
>>>>> The current tier initialization code always initializes each memory-only NUMA
>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance
>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>>>>> should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>>>>> the next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to nodes with shortest
>>>>> distance on the next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier. This strict, demotion order does not work in all use
>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>>>>> node in the same demotion tier as a fallback when the preferred demotion node is
>>>>> out of space), This demotion order is also inconsistent with the page allocation
>>>>> fallback order when all the nodes in a higher tier are out of space: The page
>>>>> allocation can fall back to any node from any lower tier, whereas the demotion
>>>>> order doesn't allow that.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>>>>> a specific type. The memory type of a device is represented by its abstract
>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows
>>>>> for classifying memory devices with a specific performance range into a memory
>>>>> tier.
>>>>>
>>>>> This patch configures the range/chunk size to be 128. The default DRAM
>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>>>>> Slower memory devices like persistent memory will have abstract distance below
>>>>> the default DRAM level and hence will be placed in these 4 lower tiers.
>>>>>
>>>>> A kernel parameter is provided to override the default memory tier.
>>>>>
>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>>  include/linux/memory-tiers.h |  17 ++++++
>>>>>  mm/Makefile                  |   1 +
>>>>>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>>>>>  3 files changed, 120 insertions(+)
>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>
>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>> new file mode 100644
>>>>> index 000000000000..8d7884b7a3f0
>>>>> --- /dev/null
>>>>> +++ b/include/linux/memory-tiers.h
>>>>> @@ -0,0 +1,17 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>> +
>>>>> +/*
>>>>> + * Each tier cover a abstrace distance chunk size of 128
>>>>> + */
>>>>> +#define MEMTIER_CHUNK_BITS	7
>>>>> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>>>>> +/*
>>>>> + * For now let's have 4 memory tier below default DRAM tier.
>>>>> + */
>>>>> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>>> +/* leave one tier below this slow pmem */
>>>>> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>>
>>>> Why is memory type encoded in these values? There is no reason to
>>>> believe that PMEM is of a lower performance tier than DRAM. Consider
>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL
>>>> attached DRAM over a switch topology and constrained links that makes it
>>>> a lower performance tier than locally attached DRAM. The names should be
>>>> associated with tiers that indicate their usage. Something like HOT,
>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high
>>>> performance compared to the general purpose pool, and COLD is high
>>>> capacity low performance intended to offload the general purpose tier.
>>>>
>>>> It does not need to be exactly that ontology, but please try to not
>>>> encode policy meaning behind memory types. There has been explicit
>>>> effort to avoid that to date because types are fraught for declaring
>>>> relative performance characteristics, and the relative performance
>>>> changes based on what memory types are assembled in a given system.
>>>
>>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>>> used in this very first version to make it as simple as possible.  
>>
>> I am failing to see the simplicity of using names that convey a
>> performance contract that are invalid depending on the system.
>>
>>> I think we can come up with something better in the later version.
>>> For example, identify the abstract distance of a PMEM device based on
>>> HMAT, etc. 
>>
>> Memory tiering has nothing to do with persistence why is PMEM in the
>> name at all?
>>
>>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>>  used now, not for all PMEM.
>>
>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
>> is also nothing PMEM specific about the device-dax subsystem.
> 
> Ah... I see the issue here.  For the systems in our hand, dax/kmem.c is
> used to online PMEM only.  Even the "soft reserved" memory is used for
> PMEM or simulating PMEM too.  So to make the code as simple as possible,
> we treat all memory devices onlined by dax/kmem as PMEM in the first
> version.  And plan to support more memory types in the future versions.
> 
> But from your above words, our assumption are wrong here.  dax/kmem.c
> can online HBM and other memory devices already.  If so, how do we
> distinguish between them and how to get the performance character of
> these devices?  We can start with SLIT?
> 

We would let low level driver register memory_dev_types for the NUMA nodes
that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL
can register different memory_dev_type based on device tree, HMAT or CDAT. 

-aneesh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-08-02  9:34           ` Aneesh Kumar K V
@ 2022-08-04  0:56             ` Huang, Ying
  2022-08-04  4:49               ` Aneesh Kumar K V
  0 siblings, 1 reply; 36+ messages in thread
From: Huang, Ying @ 2022-08-04  0:56 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Dan Williams, linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/2/22 12:27 PM, Huang, Ying wrote:
>> Dan Williams <dan.j.williams@intel.com> writes:
>> 
>>> Huang, Ying wrote:
>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>
>>>>> Aneesh Kumar K.V wrote:
>>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path
>>>>>> relationship between NUMA nodes, which is created during the kernel
>>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>>>>>> current implementation puts all nodes with CPU into the highest tier, and builds
>>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>>>>>> based on the distances between nodes.
>>>>>>
>>>>>> This current memory tier kernel implementation needs to be improved for several
>>>>>> important use cases,
>>>>>>
>>>>>> The current tier initialization code always initializes each memory-only NUMA
>>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance
>>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>>>>>> should be put into a higher tier.
>>>>>>
>>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>>>>>> the next lower tier.
>>>>>>
>>>>>> With current kernel higher tier node can only be demoted to nodes with shortest
>>>>>> distance on the next lower tier as defined by the demotion path, not any other
>>>>>> node from any lower tier. This strict, demotion order does not work in all use
>>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>>>>>> node in the same demotion tier as a fallback when the preferred demotion node is
>>>>>> out of space), This demotion order is also inconsistent with the page allocation
>>>>>> fallback order when all the nodes in a higher tier are out of space: The page
>>>>>> allocation can fall back to any node from any lower tier, whereas the demotion
>>>>>> order doesn't allow that.
>>>>>>
>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>
>>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>>>>>> a specific type. The memory type of a device is represented by its abstract
>>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows
>>>>>> for classifying memory devices with a specific performance range into a memory
>>>>>> tier.
>>>>>>
>>>>>> This patch configures the range/chunk size to be 128. The default DRAM
>>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>>>>>> Slower memory devices like persistent memory will have abstract distance below
>>>>>> the default DRAM level and hence will be placed in these 4 lower tiers.
>>>>>>
>>>>>> A kernel parameter is provided to override the default memory tier.
>>>>>>
>>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>
>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> ---
>>>>>>  include/linux/memory-tiers.h |  17 ++++++
>>>>>>  mm/Makefile                  |   1 +
>>>>>>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>>>>>>  3 files changed, 120 insertions(+)
>>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>>
>>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..8d7884b7a3f0
>>>>>> --- /dev/null
>>>>>> +++ b/include/linux/memory-tiers.h
>>>>>> @@ -0,0 +1,17 @@
>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>>> +
>>>>>> +/*
>>>>>> + * Each tier cover a abstrace distance chunk size of 128
>>>>>> + */
>>>>>> +#define MEMTIER_CHUNK_BITS	7
>>>>>> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>>>>>> +/*
>>>>>> + * For now let's have 4 memory tier below default DRAM tier.
>>>>>> + */
>>>>>> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>>>> +/* leave one tier below this slow pmem */
>>>>>> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>>>
>>>>> Why is memory type encoded in these values? There is no reason to
>>>>> believe that PMEM is of a lower performance tier than DRAM. Consider
>>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL
>>>>> attached DRAM over a switch topology and constrained links that makes it
>>>>> a lower performance tier than locally attached DRAM. The names should be
>>>>> associated with tiers that indicate their usage. Something like HOT,
>>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high
>>>>> performance compared to the general purpose pool, and COLD is high
>>>>> capacity low performance intended to offload the general purpose tier.
>>>>>
>>>>> It does not need to be exactly that ontology, but please try to not
>>>>> encode policy meaning behind memory types. There has been explicit
>>>>> effort to avoid that to date because types are fraught for declaring
>>>>> relative performance characteristics, and the relative performance
>>>>> changes based on what memory types are assembled in a given system.
>>>>
>>>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>>>> used in this very first version to make it as simple as possible.  
>>>
>>> I am failing to see the simplicity of using names that convey a
>>> performance contract that are invalid depending on the system.
>>>
>>>> I think we can come up with something better in the later version.
>>>> For example, identify the abstract distance of a PMEM device based on
>>>> HMAT, etc. 
>>>
>>> Memory tiering has nothing to do with persistence why is PMEM in the
>>> name at all?
>>>
>>>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>>>  used now, not for all PMEM.
>>>
>>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
>>> is also nothing PMEM specific about the device-dax subsystem.
>> 
>> Ah... I see the issue here.  For the systems in our hand, dax/kmem.c is
>> used to online PMEM only.  Even the "soft reserved" memory is used for
>> PMEM or simulating PMEM too.  So to make the code as simple as possible,
>> we treat all memory devices onlined by dax/kmem as PMEM in the first
>> version.  And plan to support more memory types in the future versions.
>> 
>> But from your above words, our assumption are wrong here.  dax/kmem.c
>> can online HBM and other memory devices already.  If so, how do we
>> distinguish between them and how to get the performance character of
>> these devices?  We can start with SLIT?
>> 
>
> We would let low level driver register memory_dev_types for the NUMA nodes
> that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL
> can register different memory_dev_type based on device tree, HMAT or CDAT. 

I didn't find ACPI NFIT can provide any performance information, just
whether it's non-volatile.  HMAT or CDAT should help here, but it's not
available always.  For now, what we have is just SLIT at least for quite
some machines.

I prefer to create memory_dev_type in high level driver like dax/kmem.
And it may query low level driver like SLIT, HMAT, CDAT, etc for more
information based on availability etc.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-08-04  0:56             ` Huang, Ying
@ 2022-08-04  4:49               ` Aneesh Kumar K V
  2022-08-04  5:19                 ` Huang, Ying
  0 siblings, 1 reply; 36+ messages in thread
From: Aneesh Kumar K V @ 2022-08-04  4:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Dan Williams, linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

On 8/4/22 6:26 AM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 8/2/22 12:27 PM, Huang, Ying wrote:
>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>
>>>> Huang, Ying wrote:
>>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>>
>>>>>> Aneesh Kumar K.V wrote:
>>>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path
>>>>>>> relationship between NUMA nodes, which is created during the kernel
>>>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>>>>>>> current implementation puts all nodes with CPU into the highest tier, and builds
>>>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>>>>>>> based on the distances between nodes.
>>>>>>>
>>>>>>> This current memory tier kernel implementation needs to be improved for several
>>>>>>> important use cases,
>>>>>>>
>>>>>>> The current tier initialization code always initializes each memory-only NUMA
>>>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance
>>>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>>>>>>> should be put into a higher tier.
>>>>>>>
>>>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>>>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>>>>>>> the next lower tier.
>>>>>>>
>>>>>>> With current kernel higher tier node can only be demoted to nodes with shortest
>>>>>>> distance on the next lower tier as defined by the demotion path, not any other
>>>>>>> node from any lower tier. This strict, demotion order does not work in all use
>>>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>>>>>>> node in the same demotion tier as a fallback when the preferred demotion node is
>>>>>>> out of space), This demotion order is also inconsistent with the page allocation
>>>>>>> fallback order when all the nodes in a higher tier are out of space: The page
>>>>>>> allocation can fall back to any node from any lower tier, whereas the demotion
>>>>>>> order doesn't allow that.
>>>>>>>
>>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>>
>>>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>>>>>>> a specific type. The memory type of a device is represented by its abstract
>>>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows
>>>>>>> for classifying memory devices with a specific performance range into a memory
>>>>>>> tier.
>>>>>>>
>>>>>>> This patch configures the range/chunk size to be 128. The default DRAM
>>>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>>>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>>>>>>> Slower memory devices like persistent memory will have abstract distance below
>>>>>>> the default DRAM level and hence will be placed in these 4 lower tiers.
>>>>>>>
>>>>>>> A kernel parameter is provided to override the default memory tier.
>>>>>>>
>>>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>>
>>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> ---
>>>>>>>  include/linux/memory-tiers.h |  17 ++++++
>>>>>>>  mm/Makefile                  |   1 +
>>>>>>>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>>>>>>>  3 files changed, 120 insertions(+)
>>>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>>>
>>>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..8d7884b7a3f0
>>>>>>> --- /dev/null
>>>>>>> +++ b/include/linux/memory-tiers.h
>>>>>>> @@ -0,0 +1,17 @@
>>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>>>> +
>>>>>>> +/*
>>>>>>> + * Each tier cover a abstrace distance chunk size of 128
>>>>>>> + */
>>>>>>> +#define MEMTIER_CHUNK_BITS	7
>>>>>>> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>>>>>>> +/*
>>>>>>> + * For now let's have 4 memory tier below default DRAM tier.
>>>>>>> + */
>>>>>>> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>>>>> +/* leave one tier below this slow pmem */
>>>>>>> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>>>>
>>>>>> Why is memory type encoded in these values? There is no reason to
>>>>>> believe that PMEM is of a lower performance tier than DRAM. Consider
>>>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL
>>>>>> attached DRAM over a switch topology and constrained links that makes it
>>>>>> a lower performance tier than locally attached DRAM. The names should be
>>>>>> associated with tiers that indicate their usage. Something like HOT,
>>>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high
>>>>>> performance compared to the general purpose pool, and COLD is high
>>>>>> capacity low performance intended to offload the general purpose tier.
>>>>>>
>>>>>> It does not need to be exactly that ontology, but please try to not
>>>>>> encode policy meaning behind memory types. There has been explicit
>>>>>> effort to avoid that to date because types are fraught for declaring
>>>>>> relative performance characteristics, and the relative performance
>>>>>> changes based on what memory types are assembled in a given system.
>>>>>
>>>>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>>>>> used in this very first version to make it as simple as possible.  
>>>>
>>>> I am failing to see the simplicity of using names that convey a
>>>> performance contract that are invalid depending on the system.
>>>>
>>>>> I think we can come up with something better in the later version.
>>>>> For example, identify the abstract distance of a PMEM device based on
>>>>> HMAT, etc. 
>>>>
>>>> Memory tiering has nothing to do with persistence why is PMEM in the
>>>> name at all?
>>>>
>>>>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>>>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>>>>  used now, not for all PMEM.
>>>>
>>>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
>>>> is also nothing PMEM specific about the device-dax subsystem.
>>>
>>> Ah... I see the issue here.  For the systems in our hand, dax/kmem.c is
>>> used to online PMEM only.  Even the "soft reserved" memory is used for
>>> PMEM or simulating PMEM too.  So to make the code as simple as possible,
>>> we treat all memory devices onlined by dax/kmem as PMEM in the first
>>> version.  And plan to support more memory types in the future versions.
>>>
>>> But from your above words, our assumption are wrong here.  dax/kmem.c
>>> can online HBM and other memory devices already.  If so, how do we
>>> distinguish between them and how to get the performance character of
>>> these devices?  We can start with SLIT?
>>>
>>
>> We would let low level driver register memory_dev_types for the NUMA nodes
>> that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL
>> can register different memory_dev_type based on device tree, HMAT or CDAT. 
> 
> I didn't find ACPI NFIT can provide any performance information, just
> whether it's non-volatile.  HMAT or CDAT should help here, but it's not
> available always.  For now, what we have is just SLIT at least for quite
> some machines.
> 


The lower level driver that is creating the nvdimm regions can assign a
memory type to the numa node which it associates with the region. For now,
drivers like papr_scm do that on ppc64. When it associates a numa node to
nvdimm regions, it can query every detail available (device tree
in case of papr_scm, can be HMAT/SLIT or CDAT) to associate the NUMA node
to a memory type. 


> I prefer to create memory_dev_type in high level driver like dax/kmem.
> And it may query low level driver like SLIT, HMAT, CDAT, etc for more
> information based on availability etc.
> 
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers
  2022-08-04  4:49               ` Aneesh Kumar K V
@ 2022-08-04  5:19                 ` Huang, Ying
  0 siblings, 0 replies; 36+ messages in thread
From: Huang, Ying @ 2022-08-04  5:19 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Dan Williams, linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 8/4/22 6:26 AM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 8/2/22 12:27 PM, Huang, Ying wrote:
>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>
>>>>> Huang, Ying wrote:
>>>>>> Dan Williams <dan.j.williams@intel.com> writes:
>>>>>>
>>>>>>> Aneesh Kumar K.V wrote:
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a demotion path
>>>>>>>> relationship between NUMA nodes, which is created during the kernel
>>>>>>>> initialization and updated when a NUMA node is hot-added or hot-removed. The
>>>>>>>> current implementation puts all nodes with CPU into the highest tier, and builds
>>>>>>>> the tier hierarchy tier-by-tier by establishing the per-node demotion targets
>>>>>>>> based on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel implementation needs to be improved for several
>>>>>>>> important use cases,
>>>>>>>>
>>>>>>>> The current tier initialization code always initializes each memory-only NUMA
>>>>>>>> node into a lower tier. But a memory-only NUMA node may have a high performance
>>>>>>>> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that
>>>>>>>> should be put into a higher tier.
>>>>>>>>
>>>>>>>> The current tier hierarchy always puts CPU nodes into the top tier. But on a
>>>>>>>> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices
>>>>>>>> should be in the top tier, and DRAM nodes with CPUs are better to be placed into
>>>>>>>> the next lower tier.
>>>>>>>>
>>>>>>>> With current kernel higher tier node can only be demoted to nodes with shortest
>>>>>>>> distance on the next lower tier as defined by the demotion path, not any other
>>>>>>>> node from any lower tier. This strict, demotion order does not work in all use
>>>>>>>> cases (e.g. some use cases may want to allow cross-socket demotion to another
>>>>>>>> node in the same demotion tier as a fallback when the preferred demotion node is
>>>>>>>> out of space), This demotion order is also inconsistent with the page allocation
>>>>>>>> fallback order when all the nodes in a higher tier are out of space: The page
>>>>>>>> allocation can fall back to any node from any lower tier, whereas the demotion
>>>>>>>> order doesn't allow that.
>>>>>>>>
>>>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>>>
>>>>>>>> Linux kernel presents memory devices as NUMA nodes and each memory device is of
>>>>>>>> a specific type. The memory type of a device is represented by its abstract
>>>>>>>> distance. A memory tier corresponds to a range of abstract distance. This allows
>>>>>>>> for classifying memory devices with a specific performance range into a memory
>>>>>>>> tier.
>>>>>>>>
>>>>>>>> This patch configures the range/chunk size to be 128. The default DRAM
>>>>>>>> abstract distance is 512. We can have 4 memory tiers below the default DRAM
>>>>>>>> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511.
>>>>>>>> Slower memory devices like persistent memory will have abstract distance below
>>>>>>>> the default DRAM level and hence will be placed in these 4 lower tiers.
>>>>>>>>
>>>>>>>> A kernel parameter is provided to override the default memory tier.
>>>>>>>>
>>>>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>>>>
>>>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>>> ---
>>>>>>>>  include/linux/memory-tiers.h |  17 ++++++
>>>>>>>>  mm/Makefile                  |   1 +
>>>>>>>>  mm/memory-tiers.c            | 102 +++++++++++++++++++++++++++++++++++
>>>>>>>>  3 files changed, 120 insertions(+)
>>>>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>>>>
>>>>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>>>>> new file mode 100644
>>>>>>>> index 000000000000..8d7884b7a3f0
>>>>>>>> --- /dev/null
>>>>>>>> +++ b/include/linux/memory-tiers.h
>>>>>>>> @@ -0,0 +1,17 @@
>>>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Each tier cover a abstrace distance chunk size of 128
>>>>>>>> + */
>>>>>>>> +#define MEMTIER_CHUNK_BITS	7
>>>>>>>> +#define MEMTIER_CHUNK_SIZE	(1 << MEMTIER_CHUNK_BITS)
>>>>>>>> +/*
>>>>>>>> + * For now let's have 4 memory tier below default DRAM tier.
>>>>>>>> + */
>>>>>>>> +#define MEMTIER_ADISTANCE_DRAM	(1 << (MEMTIER_CHUNK_BITS + 2))
>>>>>>>> +/* leave one tier below this slow pmem */
>>>>>>>> +#define MEMTIER_ADISTANCE_PMEM	(1 << MEMTIER_CHUNK_BITS)
>>>>>>>
>>>>>>> Why is memory type encoded in these values? There is no reason to
>>>>>>> believe that PMEM is of a lower performance tier than DRAM. Consider
>>>>>>> high performance energy backed DRAM that makes it "PMEM", consider CXL
>>>>>>> attached DRAM over a switch topology and constrained links that makes it
>>>>>>> a lower performance tier than locally attached DRAM. The names should be
>>>>>>> associated with tiers that indicate their usage. Something like HOT,
>>>>>>> GENERAL, and COLD. Where, for example, HOT is low capacity high
>>>>>>> performance compared to the general purpose pool, and COLD is high
>>>>>>> capacity low performance intended to offload the general purpose tier.
>>>>>>>
>>>>>>> It does not need to be exactly that ontology, but please try to not
>>>>>>> encode policy meaning behind memory types. There has been explicit
>>>>>>> effort to avoid that to date because types are fraught for declaring
>>>>>>> relative performance characteristics, and the relative performance
>>>>>>> changes based on what memory types are assembled in a given system.
>>>>>>
>>>>>> Yes.  MEMTIER_ADISTANCE_PMEM is something over simplified.  That is only
>>>>>> used in this very first version to make it as simple as possible.  
>>>>>
>>>>> I am failing to see the simplicity of using names that convey a
>>>>> performance contract that are invalid depending on the system.
>>>>>
>>>>>> I think we can come up with something better in the later version.
>>>>>> For example, identify the abstract distance of a PMEM device based on
>>>>>> HMAT, etc. 
>>>>>
>>>>> Memory tiering has nothing to do with persistence why is PMEM in the
>>>>> name at all?
>>>>>
>>>>>>  And even in this first version, we should put MEMTIER_ADISTANCE_PMEM
>>>>>>  in dax/kmem.c.  Because it's just for that specific type of memory
>>>>>>  used now, not for all PMEM.
>>>>>
>>>>> dax/kmem.c also handles HBM and "soft reserved" memory in general. There
>>>>> is also nothing PMEM specific about the device-dax subsystem.
>>>>
>>>> Ah... I see the issue here.  For the systems in our hand, dax/kmem.c is
>>>> used to online PMEM only.  Even the "soft reserved" memory is used for
>>>> PMEM or simulating PMEM too.  So to make the code as simple as possible,
>>>> we treat all memory devices onlined by dax/kmem as PMEM in the first
>>>> version.  And plan to support more memory types in the future versions.
>>>>
>>>> But from your above words, our assumption are wrong here.  dax/kmem.c
>>>> can online HBM and other memory devices already.  If so, how do we
>>>> distinguish between them and how to get the performance character of
>>>> these devices?  We can start with SLIT?
>>>>
>>>
>>> We would let low level driver register memory_dev_types for the NUMA nodes
>>> that will be mapped to these devices. ie, a papr_scm, ACPI NFIT or CXL
>>> can register different memory_dev_type based on device tree, HMAT or CDAT. 
>> 
>> I didn't find ACPI NFIT can provide any performance information, just
>> whether it's non-volatile.  HMAT or CDAT should help here, but it's not
>> available always.  For now, what we have is just SLIT at least for quite
>> some machines.
>> 
>
>
> The lower level driver that is creating the nvdimm regions can assign a
> memory type to the numa node which it associates with the region. For now,
> drivers like papr_scm do that on ppc64. When it associates a numa node to
> nvdimm regions, it can query every detail available (device tree
> in case of papr_scm, can be HMAT/SLIT or CDAT) to associate the NUMA node
> to a memory type. 

If we have only one information source, it's OK to create all memory
type with this source.  But if we have multiple sources, we need a
mechanism to coordinate among these sources.  It gives us good
flexibility to create memory types in driver.  Because drivers can
use any information sources.

Best Regards,
Huang, Ying

>> I prefer to create memory_dev_type in high level driver like dax/kmem.
>> And it may query low level driver like SLIT, HMAT, CDAT, etc for more
>> information based on availability etc.
>> 
>> Best Regards,
>> Huang, Ying

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-08-04  5:19 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-28 19:04 [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-28 19:04 ` [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-29  6:25   ` Huang, Ying
2022-07-29  7:24     ` Aneesh Kumar K.V
     [not found]   ` <62e890da7f784_577a029473@dwillia2-xfh.jf.intel.com.notmuch>
     [not found]     ` <874jyvjpw9.fsf@yhuang6-desk2.ccr.corp.intel.com>
     [not found]       ` <62e89c9addcc_62c2a29443@dwillia2-xfh.jf.intel.com.notmuch>
2022-08-02  5:03         ` Aneesh Kumar K V
     [not found]         ` <87zggni13h.fsf@yhuang6-desk2.ccr.corp.intel.com>
2022-08-02  9:34           ` Aneesh Kumar K V
2022-08-04  0:56             ` Huang, Ying
2022-08-04  4:49               ` Aneesh Kumar K V
2022-08-04  5:19                 ` Huang, Ying
2022-07-28 19:04 ` [PATCH v11 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-28 19:04 ` [PATCH v11 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-28 19:04 ` [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM Aneesh Kumar K.V
2022-07-29  6:20   ` Huang, Ying
2022-07-29  7:19     ` Aneesh Kumar K.V
2022-08-01  2:06       ` Huang, Ying
2022-08-01  4:40         ` Aneesh Kumar K V
2022-08-01  5:10           ` Huang, Ying
2022-08-01  5:38             ` Aneesh Kumar K V
2022-08-01  6:37               ` Huang, Ying
2022-08-01  6:55                 ` Aneesh Kumar K V
2022-08-01  7:13                   ` Huang, Ying
2022-08-01  7:41                     ` Aneesh Kumar K V
2022-08-02  1:58                       ` Huang, Ying
2022-07-28 19:04 ` [PATCH v11 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-29  6:35   ` Huang, Ying
2022-07-29  7:22     ` Aneesh Kumar K.V
2022-08-01  2:15       ` Huang, Ying
2022-07-28 19:04 ` [PATCH v11 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-28 19:04 ` [PATCH v11 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-28 19:04 ` [PATCH v11 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
2022-07-29  6:39   ` Huang, Ying
2022-07-29  6:41     ` Aneesh Kumar K V
2022-07-29  6:47       ` Aneesh Kumar K V
2022-08-01  1:04         ` Huang, Ying
2022-07-29  5:30 ` [PATCH v11 0/8] mm/demotion: Memory tiers and demotion Huang, Ying
2022-07-29  6:17   ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).