All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v9 0/8] mm/demotion: Memory tiers and demotion
@ 2022-07-14  4:53 Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.

This patch series make the creation of memory tiers explicit under
the control of device driver.

Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier with
tier ID value 200.

A device driver can move its memory nodes from the default
tier.  For example, PMEM can move its memory nodes below the
default tier, whereas GPU can move its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Changes from v8:
* Drop the sysfs interface patches and related documentation changes.

Changes from v7:
* Fix kernel crash with demotion.
* Improve documentation.

Changes from v6:
* Drop the usage of rank.
* Address other review feedback.

Changes from v5:
* Remove patch supporting N_MEMORY node removal from memory tiers. memory tiers
  are going to be used for features other than demotion. Hence keep all N_MEMORY
  nodes in memory tiers irrespective of whether they want to participate in promotion or demotion.
* Add NODE_DATA->memtier
* Rearrage patches to add sysfs files later.
* Add support to create memory tiers from userspace.
* Address other review feedback.


Changes from v4:
* Address review feedback.
* Reverse the meaning of "rank": higher rank value means higher tier.
* Add "/sys/devices/system/memtier/default_tier".
* Add node_is_toptier

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.



Aneesh Kumar K.V (7):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Move memory demotion related code
  mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  mm/demotion: Add hotplug callbacks to handle new numa node onlined
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion: Add pg_data_t member to track node memory tier details
  mm/demotion: Update node_is_toptier to work with memory tiers

Jagdish Gediya (1):
  mm/demotion: Demote pages according to allocation fallback order

 drivers/dax/kmem.c           |   6 +-
 include/linux/memory-tiers.h |  59 ++++
 include/linux/migrate.h      |  15 -
 include/linux/mmzone.h       |   3 +
 include/linux/node.h         |   5 -
 mm/Makefile                  |   1 +
 mm/huge_memory.c             |   1 +
 mm/memory-tiers.c            | 653 +++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 453 +-----------------------
 mm/mprotect.c                |   1 +
 mm/vmscan.c                  |  59 +++-
 mm/vmstat.c                  |   4 -
 12 files changed, 768 insertions(+), 492 deletions(-)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

-- 
2.36.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-15  7:53   ` Huang, Ying
  2022-07-14  4:53 ` [PATCH v9 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch introduce explicity memory tiers. The tier ID value
of a memory tier is used to derive the demotion order between
NUMA nodes.

For example, if we have 3 memtiers: memtier100, memtier200, memiter300
then the memory tier order is: memtier300 -> memtier200 -> memtier100
where memtier300 is the highest tier and memtier100 is the lowest tier.

While reclaim we migrate pages from fast(higher) tiers to slow(lower)
tiers when the fast(higher) tier is under memory pressure.

This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
which are created by different kernel subsystems. The default memory
tier created by the kernel is memtier200. A kernel parameter is provided
to override the default memory tier.

Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 15 +++++++
 mm/Makefile                  |  1 +
 mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..a81dbc20e0d1
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_NUMA
+
+#define MEMORY_TIER_HBM_GPU	300
+#define MEMORY_TIER_DRAM	200
+#define MEMORY_TIER_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIER_ID	400
+
+#endif	/* CONFIG_NUMA */
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..011877b6dbb9
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	int id;
+	nodemask_t nodelist;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->id < memtier->id) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+	struct memory_tier *memtier;
+
+	if (tier > MAX_MEMORY_TIER_ID)
+		return ERR_PTR(-EINVAL);
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return ERR_PTR(-ENOMEM);
+
+	memtier->id   = tier;
+
+	insert_memory_tier(memtier);
+
+	return memtier;
+}
+
+static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
+core_param(default_memory_tier, default_memtier, uint, 0644);
+
+static int __init memory_tier_init(void)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs. Since this is early during
+	 * boot, we could avoid holding memtory_tier_lock. But
+	 * keep it simple by holding locks. So we can add lock
+	 * held debug checks in other functions.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = register_memory_tier(default_memtier);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+	mutex_unlock(&memory_tier_lock);
+	return 0;
+}
+subsys_initcall(memory_tier_init);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 2/8] mm/demotion: Move memory demotion related code
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This move memory demotion related code to mm/memory-tiers.c.
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  7 ++++
 include/linux/migrate.h      |  2 --
 mm/memory-tiers.c            | 63 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 | 60 +---------------------------------
 mm/vmscan.c                  |  1 +
 5 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a81dbc20e0d1..c47dbe381089 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_MEMORY_TIERS_H
 #define _LINUX_MEMORY_TIERS_H
 
+#include <linux/types.h>
+
 #ifdef CONFIG_NUMA
 
 #define MEMORY_TIER_HBM_GPU	300
@@ -11,5 +13,10 @@
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
 #define MAX_MEMORY_TIER_ID	400
 
+extern bool numa_demotion_enabled;
+
+#else
+
+#define numa_demotion_enabled	false
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 069a89e847f3..43e737215f33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
 extern void set_migration_target_nodes(void);
 extern void migrate_on_reclaim_init(void);
-extern bool numa_demotion_enabled;
 extern int next_demotion_node(int node);
 #else
 static inline void set_migration_target_nodes(void) {}
@@ -87,7 +86,6 @@ static inline int next_demotion_node(int node)
 {
         return NUMA_NO_NODE;
 }
-#define numa_demotion_enabled  false
 #endif
 
 #ifdef CONFIG_COMPACTION
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 011877b6dbb9..5cb7a351594b 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/types.h>
+#include <linux/device.h>
 #include <linux/nodemask.h>
 #include <linux/slab.h>
 #include <linux/lockdep.h>
@@ -76,3 +77,65 @@ static int __init memory_tier_init(void)
 	return 0;
 }
 subsys_initcall(memory_tier_init);
+
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_MIGRATION
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_demotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif /* CONFIG_SYSFS */
+#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c1ea61f39d8..fce7d4a9e940 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void)
 	set_migration_target_nodes();
 	cpus_read_unlock();
 }
+#endif /* CONFIG_NUMA */
 
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled ? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	ssize_t ret;
-
-	ret = kstrtobool(buf, &numa_demotion_enabled);
-	if (ret)
-		return ret;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
 
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif /* CONFIG_SYSFS */
-#endif /* CONFIG_NUMA */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7..3a8f78277f99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -50,6 +50,7 @@
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V, Jagdish Gediya

By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is the memory tier designated for nodes with DRAM

Set dax kmem device node's tier to MEMORY_TIER_PMEM. MEMORY_TIER_PMEM
appears below DEFAULT_MEMORY_TIER in demotion order.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c           |  6 ++-
 include/linux/memory-tiers.h |  5 +++
 mm/memory-tiers.c            | 79 ++++++++++++++++++++++++++++++++++++
 3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..0c03889286ac 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/memory-tiers.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -41,6 +42,9 @@ struct dax_kmem_data {
 	struct resource *res[];
 };
 
+static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM;
+module_param(dax_kmem_memtier, uint, 0644);
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
@@ -146,7 +150,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-
+	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
 	return 0;
 
 err_request_mem:
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index c47dbe381089..9d36ff13c954 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -14,9 +14,14 @@
 #define MAX_MEMORY_TIER_ID	400
 
 extern bool numa_demotion_enabled;
+int node_create_and_set_memory_tier(int node, int tier);
 
 #else
 
 #define numa_demotion_enabled	false
+static inline int node_create_and_set_memory_tier(int node, int tier)
+{
+	return 0;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 5cb7a351594b..79347d4ab05e 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -51,6 +51,85 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 	return memtier;
 }
 
+static void unregister_memory_tier(struct memory_tier *memtier)
+{
+	list_del(&memtier->list);
+	kfree(memtier);
+}
+
+static struct memory_tier *__node_get_memory_tier(int node)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (node_isset(node, memtier->nodelist))
+			return memtier;
+	}
+	return NULL;
+}
+
+static struct memory_tier *__get_memory_tier_from_id(int id)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (memtier->id == id)
+			return memtier;
+	}
+	return NULL;
+}
+
+static int __node_create_and_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		memtier = register_memory_tier(tier);
+		if (IS_ERR(memtier)) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+	node_set(node, memtier->nodelist);
+out:
+	return ret;
+}
+
+int node_create_and_set_memory_tier(int node, int tier)
+{
+	struct memory_tier *current_tier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+
+	current_tier = __node_get_memory_tier(node);
+	if (!current_tier) {
+		ret = __node_create_and_set_memory_tier(node, tier);
+		goto out;
+	}
+
+	if (current_tier->id == tier)
+		goto out;
+
+	node_clear(node, current_tier->nodelist);
+
+	ret = __node_create_and_set_memory_tier(node, tier);
+	if (ret) {
+		/* reset it back to older tier */
+		node_set(node, current_tier->nodelist);
+		goto out;
+	}
+	if (nodes_empty(current_tier->nodelist))
+		unregister_memory_tier(current_tier);
+out:
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2022-07-14  4:53 ` [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-15  4:38   ` Alistair Popple
  2022-07-14  4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

If the new NUMA node onlined doesn't have a memory tier assigned,
the kernel adds the NUMA node to default memory tier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory-tiers.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 79347d4ab05e..5706ad647136 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include <linux/slab.h>
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
+#include <linux/memory.h>
 #include <linux/memory-tiers.h>
 
 struct memory_tier {
@@ -130,8 +131,73 @@ int node_create_and_set_memory_tier(int node, int tier)
 }
 EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
 
+static int __node_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		ret = -EINVAL;
+		goto out;
+	}
+	node_set(node, memtier->nodelist);
+out:
+	return ret;
+}
+
+static int node_set_memory_tier(int node, int tier)
+{
+	struct memory_tier *memtier;
+	int ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+	memtier = __node_get_memory_tier(node);
+	if (!memtier)
+		ret = __node_set_memory_tier(node, tier);
+
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
+/*
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *_arg)
+{
+	struct memory_notify *arg = _arg;
+
+	/*
+	 * Only update the node migration order when a node is
+	 * changing status, like online->offline.
+	 */
+	if (arg->status_change_nid < 0)
+		return notifier_from_errno(0);
+
+	switch (action) {
+	case MEM_ONLINE:
+		/*
+		 * We ignore the error here, if the node already have the tier
+		 * registered, we will continue to use that for the new memory
+		 * we are adding here.
+		 */
+		node_set_memory_tier(arg->status_change_nid, default_memtier);
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static void __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+}
 
 static int __init memory_tier_init(void)
 {
@@ -153,6 +219,8 @@ static int __init memory_tier_init(void)
 	/* CPU only nodes are not part of memory tiers. */
 	memtier->nodelist = node_states[N_MEMORY];
 	mutex_unlock(&memory_tier_lock);
+
+	migrate_on_reclaim_init();
 	return 0;
 }
 subsys_initcall(memory_tier_init);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2022-07-14  4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-15  4:47   ` Alistair Popple
  2022-07-14  4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default tier 200 and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

A new memory tier can be inserted into the tier hierarchy for a new set
of nodes without affecting the node assignment of any existing memtier,
provided that there is enough gap in the tier ID values for the new memtier.

The absolute value of tier ID of a memtier doesn't necessarily carry any meaning.
Its value relative to other memtiers decides the level of this memtier in the tier
hierarchy.

For now, This patch supports hardcoded tier ID values which are 300, 200 and 100 for
memory tiers.

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  13 ++
 include/linux/migrate.h      |  13 --
 mm/memory-tiers.c            | 227 ++++++++++++++++++++
 mm/migrate.c                 | 394 -----------------------------------
 mm/vmstat.c                  |   4 -
 5 files changed, 240 insertions(+), 411 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 9d36ff13c954..3234301c2537 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -15,6 +15,14 @@
 
 extern bool numa_demotion_enabled;
 int node_create_and_set_memory_tier(int node, int tier);
+#ifdef CONFIG_MIGRATION
+int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
 
 #else
 
@@ -23,5 +31,10 @@ static inline int node_create_and_set_memory_tier(int node, int tier)
 {
 	return 0;
 }
+
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 43e737215f33..93fab62e6548 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 
 #endif /* CONFIG_MIGRATION */
 
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
-extern void set_migration_target_nodes(void);
-extern void migrate_on_reclaim_init(void);
-extern int next_demotion_node(int node);
-#else
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
-static inline int next_demotion_node(int node)
-{
-        return NUMA_NO_NODE;
-}
-#endif
-
 #ifdef CONFIG_COMPACTION
 extern int PageMovable(struct page *page);
 extern void __SetPageMovable(struct page *page, struct address_space *mapping);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 5706ad647136..e951f54ce56c 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,16 +6,85 @@
 #include <linux/lockdep.h>
 #include <linux/moduleparam.h>
 #include <linux/memory.h>
+#include <linux/random.h>
 #include <linux/memory-tiers.h>
 
+#include "internal.h"
+
 struct memory_tier {
 	struct list_head list;
 	int id;
 	nodemask_t nodelist;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
+static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
 
 static void insert_memory_tier(struct memory_tier *memtier)
 {
@@ -108,6 +177,7 @@ int node_create_and_set_memory_tier(int node, int tier)
 	current_tier = __node_get_memory_tier(node);
 	if (!current_tier) {
 		ret = __node_create_and_set_memory_tier(node, tier);
+		establish_migration_targets();
 		goto out;
 	}
 
@@ -124,6 +194,8 @@ int node_create_and_set_memory_tier(int node, int tier)
 	}
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
+
+	establish_migration_targets();
 out:
 	mutex_unlock(&memory_tier_lock);
 
@@ -153,14 +225,152 @@ static int node_set_memory_tier(int node, int tier)
 
 	mutex_lock(&memory_tier_lock);
 	memtier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
+	 * will have skipped this node.
+	 */
 	if (!memtier)
 		ret = __node_set_memory_tier(node, tier);
+	establish_migration_targets();
 
 	mutex_unlock(&memory_tier_lock);
 
 	return ret;
 }
 
+#ifdef CONFIG_MIGRATION
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * Return: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	struct demotion_nodes *nd;
+	int target;
+
+	if (!node_demotion)
+		return NUMA_NO_NODE;
+
+	nd = &node_demotion[node];
+
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
+	 */
+	rcu_read_lock();
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	target = node_random(&nd->preferred);
+	rcu_read_unlock();
+
+	return target;
+}
+
+/* Disable reclaim-based migration. */
+static void __disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_node_state(node, N_MEMORY)
+		node_demotion[node].preferred = NODE_MASK_NONE;
+}
+
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 */
+	synchronize_rcu();
+}
+#else
+static void disable_all_migrate_targets(void) {}
+#endif
+
+/*
+ * Find an automatic demotion target for all memory
+ * nodes. Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static void establish_migration_targets(void)
+{
+	struct memory_tier *memtier;
+	struct demotion_nodes *nd;
+	int target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t used;
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
+		return;
+
+	disable_all_migrate_targets();
+
+	for_each_node_state(node, N_MEMORY) {
+		best_distance = -1;
+		nd = &node_demotion[node];
+
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+			continue;
+		/*
+		 * Get the next memtier to find the  demotion node list.
+		 */
+		memtier = list_next_entry(memtier, list);
+
+		/*
+		 * find_next_best_node, use 'used' nodemask as a skip list.
+		 * Add all memory nodes except the selected memory tier
+		 * nodelist to skip list so that we find the best node from the
+		 * memtier nodelist.
+		 */
+		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
+
+		/*
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
+		 */
+		do {
+			target = find_next_best_node(node, &used);
+			if (target == NUMA_NO_NODE)
+				break;
+
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
+		} while (1);
+	}
+}
+
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
 core_param(default_memory_tier, default_memtier, uint, 0644);
 /*
@@ -181,6 +391,17 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 		return notifier_from_errno(0);
 
 	switch (action) {
+	case MEM_OFFLINE:
+		/*
+		 * In case we are moving out of N_MEMORY. Keep the node
+		 * in the memory tier so that when we bring memory online,
+		 * they appear in the right memory tier. We still need
+		 * to rebuild the demotion order.
+		 */
+		mutex_lock(&memory_tier_lock);
+		establish_migration_targets();
+		mutex_unlock(&memory_tier_lock);
+		break;
 	case MEM_ONLINE:
 		/*
 		 * We ignore the error here, if the node already have the tier
@@ -196,6 +417,12 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 
 static void __init migrate_on_reclaim_init(void)
 {
+
+	if (IS_ENABLED(CONFIG_MIGRATION)) {
+		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+					GFP_KERNEL);
+		WARN_ON(!node_demotion);
+	}
 	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fce7d4a9e940..c758c9c21d7d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
-/**
- * next_demotion_node() - Get the next node in the demotion path
- * @node: The starting node to lookup the next node
- *
- * Return: node id for next memory node in the demotion path hierarchy
- * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
- * @node online or guarantee that it *continues* to be the next demotion
- * target.
- */
-int next_demotion_node(int node)
-{
-	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	/*
-	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
-	 *
-	 * Make sure to use RCU over entire code blocks if
-	 * node_demotion[] reads need to be consistent.
-	 */
-	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
-
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
-
-	target = READ_ONCE(nd->nodes[index]);
-
-out:
-	rcu_read_unlock();
-	return target;
-}
-
-/* Disable reclaim-based migration. */
-static void __disable_all_migrate_targets(void)
-{
-	int node, i;
-
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
-}
-
-static void disable_all_migrate_targets(void)
-{
-	__disable_all_migrate_targets();
-
-	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	synchronize_rcu();
-}
-
-/*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
-{
-	int migration_target, index, val;
-	struct demotion_nodes *nd;
-
-	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
-
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
-
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass;
-	nodemask_t this_pass;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
-
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
-
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
-
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
-
-		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
-		 */
-		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
-				break;
-
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
-		} while (1);
-	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
-}
-
-/*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
- */
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
-						 unsigned long action, void *_arg)
-{
-	struct memory_notify *arg = _arg;
-
-	/*
-	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
-	 */
-	if (arg->status_change_nid < 0)
-		return notifier_from_errno(0);
-
-	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
-	case MEM_OFFLINE:
-	case MEM_ONLINE:
-		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_CANCEL_OFFLINE:
-		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
-		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
-		break;
-	}
-
-	return notifier_from_errno(0);
-}
-#endif
-
-void __init migrate_on_reclaim_init(void)
-{
-	node_demotion = kcalloc(nr_node_ids,
-				sizeof(struct demotion_nodes),
-				GFP_KERNEL);
-	WARN_ON(!node_demotion);
-#ifdef CONFIG_MEMORY_HOTPLUG
-	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-#endif
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
-}
 #endif /* CONFIG_NUMA */
-
-
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..35c6ff97cf29 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -28,7 +28,6 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
-#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-	migrate_on_reclaim_init();
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2022-07-14  4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-15  5:49   ` Alistair Popple
  2022-07-14  4:53 ` [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
  7 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

Also update different helpes to use NODE_DATA()->memtier. Since
node specific memtier can change based on the reassignment of
NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
needs to happen under an rcu read lock or memory_tier_lock.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/mmzone.h |  3 ++
 mm/memory-tiers.c      | 64 +++++++++++++++++++++++++++++++-----------
 2 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..353812495a70 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_NUMA
+	struct memory_tier __rcu *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index e951f54ce56c..bab4700bf58d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -7,6 +7,7 @@
 #include <linux/moduleparam.h>
 #include <linux/memory.h>
 #include <linux/random.h>
+#include <linux/rcupdate.h>
 #include <linux/memory-tiers.h>
 
 #include "internal.h"
@@ -124,18 +125,23 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 static void unregister_memory_tier(struct memory_tier *memtier)
 {
 	list_del(&memtier->list);
-	kfree(memtier);
+	kfree_rcu(memtier);
 }
 
 static struct memory_tier *__node_get_memory_tier(int node)
 {
-	struct memory_tier *memtier;
+	pg_data_t *pgdat;
 
-	list_for_each_entry(memtier, &memory_tiers, list) {
-		if (node_isset(node, memtier->nodelist))
-			return memtier;
-	}
-	return NULL;
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return NULL;
+	/*
+	 * Since we hold memory_tier_lock, we can avoid
+	 * RCU read locks when accessing the details. No
+	 * parallel updates are possible here.
+	 */
+	return rcu_dereference_check(pgdat->memtier,
+				     lockdep_is_held(&memory_tier_lock));
 }
 
 static struct memory_tier *__get_memory_tier_from_id(int id)
@@ -149,6 +155,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
+/*
+ * Called with memory_tier_lock. Hence the device references cannot
+ * be dropped during this function.
+ */
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+	struct memory_tier *current_memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	current_memtier = rcu_dereference_check(pgdat->memtier,
+						lockdep_is_held(&memory_tier_lock));
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	if (current_memtier)
+		node_clear(node, current_memtier->nodelist);
+	synchronize_rcu();
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
 static int __node_create_and_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
@@ -162,7 +195,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
 			goto out;
 		}
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -184,14 +217,7 @@ int node_create_and_set_memory_tier(int node, int tier)
 	if (current_tier->id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
-
 	ret = __node_create_and_set_memory_tier(node, tier);
-	if (ret) {
-		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
-		goto out;
-	}
 	if (nodes_empty(current_tier->nodelist))
 		unregister_memory_tier(current_tier);
 
@@ -213,7 +239,7 @@ static int __node_set_memory_tier(int node, int tier)
 		ret = -EINVAL;
 		goto out;
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -428,6 +454,7 @@ static void __init migrate_on_reclaim_init(void)
 
 static int __init memory_tier_init(void)
 {
+	int node;
 	struct memory_tier *memtier;
 
 	/*
@@ -444,7 +471,10 @@ static int __init memory_tier_init(void)
 		      __func__, PTR_ERR(memtier));
 
 	/* CPU only nodes are not part of memory tiers. */
-	memtier->nodelist = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY) {
+		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
+		node_set(node, memtier->nodelist);
+	}
 	mutex_unlock(&memory_tier_lock);
 
 	migrate_on_reclaim_init();
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2022-07-14  4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  2022-07-14  4:53 ` [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V
  7 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya,
	Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya.oss@gmail.com>

Currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path.
This strict, hard-coded demotion order does not work in all
use cases (e.g. some use cases may want to allow cross-socket
demotion to another node in the same demotion tier as a fallback
when the preferred demotion node is out of space). This demotion
order is also inconsistent with the page allocation fallback order
when all the nodes in a higher tier are out of space: The page
allocation can fall back to any node from any lower tier, whereas
the demotion order doesn't allow that currently.

This patch adds support to get all the allowed demotion targets
for a memory tier. demote_page_list() function is now modified
to utilize this allowed node mask as the fallback allocation mask.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 15 ++++++-
 mm/memory-tiers.c            | 77 +++++++++++++++++++++++++++++++++---
 mm/vmscan.c                  | 58 ++++++++++++++++++++-------
 3 files changed, 129 insertions(+), 21 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 3234301c2537..8137e9322603 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -3,6 +3,8 @@
 #define _LINUX_MEMORY_TIERS_H
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/mmzone.h>
 
 #ifdef CONFIG_NUMA
 
@@ -17,12 +19,18 @@ extern bool numa_demotion_enabled;
 int node_create_and_set_memory_tier(int node, int tier);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
 #else
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
-#endif
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
+#endif /* CONFIG_MIGRATION */
 
 #else
 
@@ -36,5 +44,10 @@ static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
+
+static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index bab4700bf58d..98e55600f250 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -16,6 +16,7 @@ struct memory_tier {
 	struct list_head list;
 	int id;
 	nodemask_t nodelist;
+	nodemask_t lower_tier_mask;
 };
 
 struct demotion_nodes {
@@ -268,6 +269,24 @@ static int node_set_memory_tier(int node, int tier)
 }
 
 #ifdef CONFIG_MIGRATION
+void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * pg_data_t.memtier updates includes a synchronize_rcu()
+	 * which ensures that we either find NULL or a valid memtier
+	 * in NODE_DATA. protect the access via rcu_read_lock();
+	 */
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (memtier)
+		*targets = memtier->lower_tier_mask;
+	else
+		*targets = NODE_MASK_NONE;
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -316,10 +335,19 @@ int next_demotion_node(int node)
 /* Disable reclaim-based migration. */
 static void __disable_all_migrate_targets(void)
 {
+	struct memory_tier *memtier;
 	int node;
 
-	for_each_node_state(node, N_MEMORY)
+	for_each_node_state(node, N_MEMORY) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		/*
+		 * We are holding memory_tier_lock, it is safe
+		 * to access pgda->memtier.
+		 */
+		memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
+						lockdep_is_held(&memory_tier_lock));
+		memtier->lower_tier_mask = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -349,10 +377,26 @@ static void establish_migration_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
-
-	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
-		return;
+	nodemask_t used, lower_tier = NODE_MASK_NONE;
+
+	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) {
+
+		for_each_node_state(node, N_MEMORY) {
+			/*
+			 * We are holding memory_tier_lock, it is safe
+			 * to access pgda->memtier.
+			 */
+			memtier = rcu_dereference_check(NODE_DATA(node)->memtier,
+							lockdep_is_held(&memory_tier_lock));
+			memtier->lower_tier_mask = NODE_MASK_NONE;
+		}
+		/*
+		 * Wait for read side to work with old values
+		 * or see the updated NODE_MASK_NONE;
+		 */
+		synchronize_rcu();
+		goto build_lower_tier_mask;
+	}
 
 	disable_all_migrate_targets();
 
@@ -395,6 +439,29 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+build_lower_tier_mask:
+	/*
+	 * Now build the lower_tier mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(lower_tier, lower_tier, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(lower_tier, node_states[N_MEMORY], lower_tier);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from lower_tier nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the lower_tier mask.
+		 */
+		nodes_andnot(lower_tier, lower_tier, memtier->nodelist);
+		memtier->lower_tier_mask = lower_tier;
+	}
 }
 
 static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..60a5235dd639 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
+	struct page *target_page;
+	nodemask_t *allowed_mask;
+	struct migration_target_control *mtc;
+
+	mtc = (struct migration_target_control *)private;
+
+	allowed_mask = mtc->nmask;
+	/*
+	 * make sure we allocate from the target node first also trying to
+	 * reclaim pages from the target node via kswapd if we are low on
+	 * free memory on target node. If we don't do this and if we have low
+	 * free memory on the target memtier, we would start allocating pages
+	 * from higher memory tiers without even forcing a demotion of cold
+	 * pages from the target memtier. This can result in the kernel placing
+	 * hotpages in higher memory tiers.
+	 */
+	mtc->nmask = NULL;
+	mtc->gfp_mask |= __GFP_THISNODE;
+	target_page = alloc_migration_target(page, (unsigned long)mtc);
+	if (target_page)
+		return target_page;
 
-	return alloc_migration_target(page, (unsigned long)&mtc);
+	mtc->gfp_mask &= ~__GFP_THISNODE;
+	mtc->nmask = allowed_mask;
+
+	return alloc_migration_target(page, (unsigned long)mtc);
 }
 
 /*
@@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers
  2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2022-07-14  4:53 ` [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-07-14  4:53 ` Aneesh Kumar K.V
  7 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-14  4:53 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Aneesh Kumar K.V

With memory tiers support we can have memory only NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers.
All NUMA nodes are by default top tier nodes. With lower memory
tiers added we consider all memory tiers above a memory tier having
CPU NUMA nodes as a top memory tier

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  6 ++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 41 ++++++++++++++++++++++++++++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 8137e9322603..0e5591ddc74b 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -31,6 +31,7 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 	*targets = NODE_MASK_NONE;
 }
 #endif /* CONFIG_MIGRATION */
+bool node_is_toptier(int node);
 
 #else
 
@@ -49,5 +50,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 834f288b3769..8405662646e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 98e55600f250..923b056e222c 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -26,6 +26,7 @@ struct demotion_nodes {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+static int top_tier_id;
 /*
  * node_demotion[] examples:
  *
@@ -268,6 +269,31 @@ static int node_set_memory_tier(int node, int tier)
 	return ret;
 }
 
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->id >= top_tier_id)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 #ifdef CONFIG_MIGRATION
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets)
 {
@@ -440,6 +466,21 @@ static void establish_migration_targets(void)
 		} while (1);
 	}
 build_lower_tier_mask:
+	/*
+	 * Promotion is allowed from a memory tier to higher
+	 * memory tier only if the memory tier doesn't include
+	 * compute. We want to  skip promotion from a memory tier,
+	 * if any node that is  part of the memory tier have CPUs.
+	 * Once we detect such a memory tier, we consider that tier
+	 * as top tiper from which promotion is not allowed.
+	 */
+	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
+		nodes_and(used, node_states[N_CPU], memtier->nodelist);
+		if (!nodes_empty(used)) {
+			top_tier_id = memtier->id;
+			break;
+		}
+	}
 	/*
 	 * Now build the lower_tier mask for each node collecting node mask from
 	 * all memory tier below it. This allows us to fallback demotion page
diff --git a/mm/migrate.c b/mm/migrate.c
index c758c9c21d7d..1da81136eaaa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-14  4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
@ 2022-07-15  4:38   ` Alistair Popple
  2022-07-15  7:23     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 29+ messages in thread
From: Alistair Popple @ 2022-07-15  4:38 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss


"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> If the new NUMA node onlined doesn't have a memory tier assigned,
> the kernel adds the NUMA node to default memory tier.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/memory-tiers.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 79347d4ab05e..5706ad647136 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -5,6 +5,7 @@
>  #include <linux/slab.h>
>  #include <linux/lockdep.h>
>  #include <linux/moduleparam.h>
> +#include <linux/memory.h>
>  #include <linux/memory-tiers.h>
>
>  struct memory_tier {
> @@ -130,8 +131,73 @@ int node_create_and_set_memory_tier(int node, int tier)
>  }
>  EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
>
> +static int __node_set_memory_tier(int node, int tier)
> +{
> +	int ret = 0;
> +	struct memory_tier *memtier;
> +
> +	memtier = __get_memory_tier_from_id(tier);
> +	if (!memtier) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	node_set(node, memtier->nodelist);
> +out:
> +	return ret;
> +}
> +
> +static int node_set_memory_tier(int node, int tier)

Minor comment, but I don't like the name of this function as it doesn't
always set the node to the given tier.

Something like this would make it clearer the tier value is only used if
the node isn't already assigned to a tier:

static int init_node_memory_tier(int node, int default_tier)

> +{
> +	struct memory_tier *memtier;
> +	int ret = 0;
> +
> +	mutex_lock(&memory_tier_lock);
> +	memtier = __node_get_memory_tier(node);
> +	if (!memtier)
> +		ret = __node_set_memory_tier(node, tier);
> +
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return ret;
> +}
> +
>  static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>  core_param(default_memory_tier, default_memtier, uint, 0644);
> +/*
> + * This runs whether reclaim-based migration is enabled or not,
> + * which ensures that the user can turn reclaim-based migration
> + * at any time without needing to recalculate migration targets.
> + */
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *_arg)
> +{
> +	struct memory_notify *arg = _arg;
> +
> +	/*
> +	 * Only update the node migration order when a node is
> +	 * changing status, like online->offline.
> +	 */
> +	if (arg->status_change_nid < 0)
> +		return notifier_from_errno(0);
> +
> +	switch (action) {
> +	case MEM_ONLINE:
> +		/*
> +		 * We ignore the error here, if the node already have the tier
> +		 * registered, we will continue to use that for the new memory
> +		 * we are adding here.
> +		 */
> +		node_set_memory_tier(arg->status_change_nid, default_memtier);
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
> +}
> +
> +static void __init migrate_on_reclaim_init(void)
> +{
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> +}
>
>  static int __init memory_tier_init(void)
>  {
> @@ -153,6 +219,8 @@ static int __init memory_tier_init(void)
>  	/* CPU only nodes are not part of memory tiers. */
>  	memtier->nodelist = node_states[N_MEMORY];
>  	mutex_unlock(&memory_tier_lock);
> +
> +	migrate_on_reclaim_init();
>  	return 0;
>  }
>  subsys_initcall(memory_tier_init);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-14  4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-07-15  4:47   ` Alistair Popple
  2022-07-15  7:21     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 29+ messages in thread
From: Alistair Popple @ 2022-07-15  4:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss


"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> This patch switch the demotion target building logic to use memory tiers
> instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
> default tier 200 and additional memory tiers will be added by drivers like
> dax kmem.
>
> This patch builds the demotion target for a NUMA node by looking at all
> memory tiers below the tier to which the NUMA node belongs. The closest node
> in the immediately following memory tier is used as a demotion target.
>
> Since we are now only building demotion target for N_MEMORY NUMA nodes
> the CPU hotplug calls are removed in this patch.
>
> A new memory tier can be inserted into the tier hierarchy for a new set
> of nodes without affecting the node assignment of any existing memtier,
> provided that there is enough gap in the tier ID values for the new memtier.
>
> The absolute value of tier ID of a memtier doesn't necessarily carry any meaning.
> Its value relative to other memtiers decides the level of this memtier in the tier
> hierarchy.
>
> For now, This patch supports hardcoded tier ID values which are 300, 200 and 100 for
> memory tiers.
>
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  13 ++
>  include/linux/migrate.h      |  13 --
>  mm/memory-tiers.c            | 227 ++++++++++++++++++++
>  mm/migrate.c                 | 394 -----------------------------------
>  mm/vmstat.c                  |   4 -
>  5 files changed, 240 insertions(+), 411 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 9d36ff13c954..3234301c2537 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -15,6 +15,14 @@
>
>  extern bool numa_demotion_enabled;
>  int node_create_and_set_memory_tier(int node, int tier);
> +#ifdef CONFIG_MIGRATION
> +int next_demotion_node(int node);
> +#else
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
> +#endif
>
>  #else
>
> @@ -23,5 +31,10 @@ static inline int node_create_and_set_memory_tier(int node, int tier)
>  {
>  	return 0;
>  }
> +
> +static inline int next_demotion_node(int node)
> +{
> +	return NUMA_NO_NODE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 43e737215f33..93fab62e6548 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>
>  #endif /* CONFIG_MIGRATION */
>
> -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> -extern void set_migration_target_nodes(void);
> -extern void migrate_on_reclaim_init(void);
> -extern int next_demotion_node(int node);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -static inline void migrate_on_reclaim_init(void) {}
> -static inline int next_demotion_node(int node)
> -{
> -        return NUMA_NO_NODE;
> -}
> -#endif
> -
>  #ifdef CONFIG_COMPACTION
>  extern int PageMovable(struct page *page);
>  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 5706ad647136..e951f54ce56c 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -6,16 +6,85 @@
>  #include <linux/lockdep.h>
>  #include <linux/moduleparam.h>
>  #include <linux/memory.h>
> +#include <linux/random.h>
>  #include <linux/memory-tiers.h>
>
> +#include "internal.h"
> +
>  struct memory_tier {
>  	struct list_head list;
>  	int id;
>  	nodemask_t nodelist;
>  };
>
> +struct demotion_nodes {
> +	nodemask_t preferred;
> +};
> +
> +static void establish_migration_targets(void);
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * node_demotion[] examples:
> + *
> + * Example 1:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
> + *
> + * node distances:
> + * node   0    1    2    3
> + *    0  10   20   30   40
> + *    1  20   10   40   30
> + *    2  30   40   10   40
> + *    3  40   30   40   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-1
> + * memory_tiers[2] = 2-3
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 3
> + * node_demotion[2].preferred = <empty>
> + * node_demotion[3].preferred = <empty>
> + *
> + * Example 2:
> + *
> + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   30
> + *    2  30   30   10
> + *
> + * memory_tiers[0] = <empty>
> + * memory_tiers[1] = 0-2
> + * memory_tiers[2] = <empty>
> + *
> + * node_demotion[0].preferred = <empty>
> + * node_demotion[1].preferred = <empty>
> + * node_demotion[2].preferred = <empty>
> + *
> + * Example 3:
> + *
> + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
> + *
> + * node distances:
> + * node   0    1    2
> + *    0  10   20   30
> + *    1  20   10   40
> + *    2  30   40   10
> + *
> + * memory_tiers[0] = 1
> + * memory_tiers[1] = 0
> + * memory_tiers[2] = 2
> + *
> + * node_demotion[0].preferred = 2
> + * node_demotion[1].preferred = 0
> + * node_demotion[2].preferred = <empty>
> + *
> + */
> +static struct demotion_nodes *node_demotion __read_mostly;
>
>  static void insert_memory_tier(struct memory_tier *memtier)
>  {
> @@ -108,6 +177,7 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	current_tier = __node_get_memory_tier(node);
>  	if (!current_tier) {
>  		ret = __node_create_and_set_memory_tier(node, tier);
> +		establish_migration_targets();
>  		goto out;
>  	}
>
> @@ -124,6 +194,8 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	}
>  	if (nodes_empty(current_tier->nodelist))
>  		unregister_memory_tier(current_tier);
> +
> +	establish_migration_targets();
>  out:
>  	mutex_unlock(&memory_tier_lock);
>
> @@ -153,14 +225,152 @@ static int node_set_memory_tier(int node, int tier)
>
>  	mutex_lock(&memory_tier_lock);
>  	memtier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> +	 * will have skipped this node.
> +	 */
>  	if (!memtier)
>  		ret = __node_set_memory_tier(node, tier);
> +	establish_migration_targets();
>
>  	mutex_unlock(&memory_tier_lock);
>
>  	return ret;
>  }
>
> +#ifdef CONFIG_MIGRATION
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * Return: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +	struct demotion_nodes *nd;
> +	int target;
> +
> +	if (!node_demotion)
> +		return NUMA_NO_NODE;
> +
> +	nd = &node_demotion[node];
> +
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * Make sure to use RCU over entire code blocks if
> +	 * node_demotion[] reads need to be consistent.
> +	 */
> +	rcu_read_lock();
> +	/*
> +	 * If there are multiple target nodes, just select one
> +	 * target node randomly.
> +	 *
> +	 * In addition, we can also use round-robin to select
> +	 * target node, but we should introduce another variable
> +	 * for node_demotion[] to record last selected target node,
> +	 * that may cause cache ping-pong due to the changing of
> +	 * last target node. Or introducing per-cpu data to avoid
> +	 * caching issue, which seems more complicated. So selecting
> +	 * target node randomly seems better until now.
> +	 */
> +	target = node_random(&nd->preferred);
> +	rcu_read_unlock();
> +
> +	return target;
> +}
> +
> +/* Disable reclaim-based migration. */
> +static void __disable_all_migrate_targets(void)
> +{
> +	int node;
> +
> +	for_each_node_state(node, N_MEMORY)
> +		node_demotion[node].preferred = NODE_MASK_NONE;
> +}
> +
> +static void disable_all_migrate_targets(void)
> +{
> +	__disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 */
> +	synchronize_rcu();
> +}
> +#else
> +static void disable_all_migrate_targets(void) {}
> +#endif
> +
> +/*
> + * Find an automatic demotion target for all memory
> + * nodes. Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static void establish_migration_targets(void)
> +{
> +	struct memory_tier *memtier;
> +	struct demotion_nodes *nd;
> +	int target = NUMA_NO_NODE, node;
> +	int distance, best_distance;
> +	nodemask_t used;
> +
> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))

Does it make sense to include the memory tiering/demotion code if
CONFIG_MIGRATION isn't enabled? From what I can tell none of the
information established here is used if CONFIG_MIGRATION isn't enabled,
so it would be better to remove the IS_ENABLED checks and not include
the code at all.

> +		return;
> +
> +	disable_all_migrate_targets();
> +
> +	for_each_node_state(node, N_MEMORY) {
> +		best_distance = -1;
> +		nd = &node_demotion[node];
> +
> +		memtier = __node_get_memory_tier(node);
> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
> +			continue;
> +		/*
> +		 * Get the next memtier to find the  demotion node list.
> +		 */
> +		memtier = list_next_entry(memtier, list);
> +
> +		/*
> +		 * find_next_best_node, use 'used' nodemask as a skip list.
> +		 * Add all memory nodes except the selected memory tier
> +		 * nodelist to skip list so that we find the best node from the
> +		 * memtier nodelist.
> +		 */
> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
> +
> +		/*
> +		 * Find all the nodes in the memory tier node list of same best distance.
> +		 * add them to the preferred mask. We randomly select between nodes
> +		 * in the preferred mask when allocating pages during demotion.
> +		 */
> +		do {
> +			target = find_next_best_node(node, &used);
> +			if (target == NUMA_NO_NODE)
> +				break;
> +
> +			distance = node_distance(node, target);
> +			if (distance == best_distance || best_distance == -1) {
> +				best_distance = distance;
> +				node_set(target, nd->preferred);
> +			} else {
> +				break;
> +			}
> +		} while (1);
> +	}
> +}
> +
>  static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>  core_param(default_memory_tier, default_memtier, uint, 0644);
>  /*
> @@ -181,6 +391,17 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>  		return notifier_from_errno(0);
>
>  	switch (action) {
> +	case MEM_OFFLINE:
> +		/*
> +		 * In case we are moving out of N_MEMORY. Keep the node
> +		 * in the memory tier so that when we bring memory online,
> +		 * they appear in the right memory tier. We still need
> +		 * to rebuild the demotion order.
> +		 */
> +		mutex_lock(&memory_tier_lock);
> +		establish_migration_targets();
> +		mutex_unlock(&memory_tier_lock);
> +		break;
>  	case MEM_ONLINE:
>  		/*
>  		 * We ignore the error here, if the node already have the tier
> @@ -196,6 +417,12 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>
>  static void __init migrate_on_reclaim_init(void)
>  {
> +
> +	if (IS_ENABLED(CONFIG_MIGRATION)) {
> +		node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
> +					GFP_KERNEL);
> +		WARN_ON(!node_demotion);
> +	}
>  	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>  }
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fce7d4a9e940..c758c9c21d7d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
> -
> -/*
> - * node_demotion[] example:
> - *
> - * Consider a system with two sockets.  Each socket has
> - * three classes of memory attached: fast, medium and slow.
> - * Each memory class is placed in its own NUMA node.  The
> - * CPUs are placed in the node with the "fast" memory.  The
> - * 6 NUMA nodes (0-5) might be split among the sockets like
> - * this:
> - *
> - *	Socket A: 0, 1, 2
> - *	Socket B: 3, 4, 5
> - *
> - * When Node 0 fills up, its memory should be migrated to
> - * Node 1.  When Node 1 fills up, it should be migrated to
> - * Node 2.  The migration path start on the nodes with the
> - * processors (since allocations default to this node) and
> - * fast memory, progress through medium and end with the
> - * slow memory:
> - *
> - *	0 -> 1 -> 2 -> stop
> - *	3 -> 4 -> 5 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *
> - *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
> - *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
> - *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
> - *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
> - *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
> - *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
> - *
> - * Moreover some systems may have multiple slow memory nodes.
> - * Suppose a system has one socket with 3 memory nodes, node 0
> - * is fast memory type, and node 1/2 both are slow memory
> - * type, and the distance between fast memory node and slow
> - * memory node is same. So the migration path should be:
> - *
> - *	0 -> 1/2 -> stop
> - *
> - * This is represented in the node_demotion[] like this:
> - *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
> - *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
> - *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
> - */
> -
> -/*
> - * Writes to this array occur without locking.  Cycles are
> - * not allowed: Node X demotes to Y which demotes to X...
> - *
> - * If multiple reads are performed, a single rcu_read_lock()
> - * must be held over all reads to ensure that no cycles are
> - * observed.
> - */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -	unsigned short nr;
> -	short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> -
> -/**
> - * next_demotion_node() - Get the next node in the demotion path
> - * @node: The starting node to lookup the next node
> - *
> - * Return: node id for next memory node in the demotion path hierarchy
> - * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> - * @node online or guarantee that it *continues* to be the next demotion
> - * target.
> - */
> -int next_demotion_node(int node)
> -{
> -	struct demotion_nodes *nd;
> -	unsigned short target_nr, index;
> -	int target;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	/*
> -	 * node_demotion[] is updated without excluding this
> -	 * function from running.  RCU doesn't provide any
> -	 * compiler barriers, so the READ_ONCE() is required
> -	 * to avoid compiler reordering or read merging.
> -	 *
> -	 * Make sure to use RCU over entire code blocks if
> -	 * node_demotion[] reads need to be consistent.
> -	 */
> -	rcu_read_lock();
> -	target_nr = READ_ONCE(nd->nr);
> -
> -	switch (target_nr) {
> -	case 0:
> -		target = NUMA_NO_NODE;
> -		goto out;
> -	case 1:
> -		index = 0;
> -		break;
> -	default:
> -		/*
> -		 * If there are multiple target nodes, just select one
> -		 * target node randomly.
> -		 *
> -		 * In addition, we can also use round-robin to select
> -		 * target node, but we should introduce another variable
> -		 * for node_demotion[] to record last selected target node,
> -		 * that may cause cache ping-pong due to the changing of
> -		 * last target node. Or introducing per-cpu data to avoid
> -		 * caching issue, which seems more complicated. So selecting
> -		 * target node randomly seems better until now.
> -		 */
> -		index = get_random_int() % target_nr;
> -		break;
> -	}
> -
> -	target = READ_ONCE(nd->nodes[index]);
> -
> -out:
> -	rcu_read_unlock();
> -	return target;
> -}
> -
> -/* Disable reclaim-based migration. */
> -static void __disable_all_migrate_targets(void)
> -{
> -	int node, i;
> -
> -	if (!node_demotion)
> -		return;
> -
> -	for_each_online_node(node) {
> -		node_demotion[node].nr = 0;
> -		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
> -			node_demotion[node].nodes[i] = NUMA_NO_NODE;
> -	}
> -}
> -
> -static void disable_all_migrate_targets(void)
> -{
> -	__disable_all_migrate_targets();
> -
> -	/*
> -	 * Ensure that the "disable" is visible across the system.
> -	 * Readers will see either a combination of before+disable
> -	 * state or disable+after.  They will never see before and
> -	 * after state together.
> -	 *
> -	 * The before+after state together might have cycles and
> -	 * could cause readers to do things like loop until this
> -	 * function finishes.  This ensures they can only see a
> -	 * single "bad" read and would, for instance, only loop
> -	 * once.
> -	 */
> -	synchronize_rcu();
> -}
> -
> -/*
> - * Find an automatic demotion target for 'node'.
> - * Failing here is OK.  It might just indicate
> - * being at the end of a chain.
> - */
> -static int establish_migrate_target(int node, nodemask_t *used,
> -				    int best_distance)
> -{
> -	int migration_target, index, val;
> -	struct demotion_nodes *nd;
> -
> -	if (!node_demotion)
> -		return NUMA_NO_NODE;
> -
> -	nd = &node_demotion[node];
> -
> -	migration_target = find_next_best_node(node, used);
> -	if (migration_target == NUMA_NO_NODE)
> -		return NUMA_NO_NODE;
> -
> -	/*
> -	 * If the node has been set a migration target node before,
> -	 * which means it's the best distance between them. Still
> -	 * check if this node can be demoted to other target nodes
> -	 * if they have a same best distance.
> -	 */
> -	if (best_distance != -1) {
> -		val = node_distance(node, migration_target);
> -		if (val > best_distance)
> -			goto out_clear;
> -	}
> -
> -	index = nd->nr;
> -	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
> -		      "Exceeds maximum demotion target nodes\n"))
> -		goto out_clear;
> -
> -	nd->nodes[index] = migration_target;
> -	nd->nr++;
> -
> -	return migration_target;
> -out_clear:
> -	node_clear(migration_target, *used);
> -	return NUMA_NO_NODE;
> -}
> -
> -/*
> - * When memory fills up on a node, memory contents can be
> - * automatically migrated to another node instead of
> - * discarded at reclaim.
> - *
> - * Establish a "migration path" which will start at nodes
> - * with CPUs and will follow the priorities used to build the
> - * page allocator zonelists.
> - *
> - * The difference here is that cycles must be avoided.  If
> - * node0 migrates to node1, then neither node1, nor anything
> - * node1 migrates to can migrate to node0. Also one node can
> - * be migrated to multiple nodes if the target nodes all have
> - * a same best-distance against the source node.
> - *
> - * This function can run simultaneously with readers of
> - * node_demotion[].  However, it can not run simultaneously
> - * with itself.  Exclusion is provided by memory hotplug events
> - * being single-threaded.
> - */
> -static void __set_migration_target_nodes(void)
> -{
> -	nodemask_t next_pass;
> -	nodemask_t this_pass;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> -	int node, best_distance;
> -
> -	/*
> -	 * Avoid any oddities like cycles that could occur
> -	 * from changes in the topology.  This will leave
> -	 * a momentary gap when migration is disabled.
> -	 */
> -	disable_all_migrate_targets();
> -
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> -
> -	for_each_node_mask(node, this_pass) {
> -		best_distance = -1;
> -
> -		/*
> -		 * Try to set up the migration path for the node, and the target
> -		 * migration nodes can be multiple, so doing a loop to find all
> -		 * the target nodes if they all have a best node distance.
> -		 */
> -		do {
> -			int target_node =
> -				establish_migrate_target(node, &used_targets,
> -							 best_distance);
> -
> -			if (target_node == NUMA_NO_NODE)
> -				break;
> -
> -			if (best_distance == -1)
> -				best_distance = node_distance(node, target_node);
> -
> -			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> -			 */
> -			node_set(target_node, next_pass);
> -		} while (1);
> -	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> -}
> -
> -/*
> - * For callers that do not hold get_online_mems() already.
> - */
> -void set_migration_target_nodes(void)
> -{
> -	get_online_mems();
> -	__set_migration_target_nodes();
> -	put_online_mems();
> -}
> -
> -/*
> - * This leaves migrate-on-reclaim transiently disabled between
> - * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
> - * whether reclaim-based migration is enabled or not, which
> - * ensures that the user can turn reclaim-based migration at
> - * any time without needing to recalculate migration targets.
> - *
> - * These callbacks already hold get_online_mems().  That is why
> - * __set_migration_target_nodes() can be used as opposed to
> - * set_migration_target_nodes().
> - */
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> -						 unsigned long action, void *_arg)
> -{
> -	struct memory_notify *arg = _arg;
> -
> -	/*
> -	 * Only update the node migration order when a node is
> -	 * changing status, like online->offline.  This avoids
> -	 * the overhead of synchronize_rcu() in most cases.
> -	 */
> -	if (arg->status_change_nid < 0)
> -		return notifier_from_errno(0);
> -
> -	switch (action) {
> -	case MEM_GOING_OFFLINE:
> -		/*
> -		 * Make sure there are not transient states where
> -		 * an offline node is a migration target.  This
> -		 * will leave migration disabled until the offline
> -		 * completes and the MEM_OFFLINE case below runs.
> -		 */
> -		disable_all_migrate_targets();
> -		break;
> -	case MEM_OFFLINE:
> -	case MEM_ONLINE:
> -		/*
> -		 * Recalculate the target nodes once the node
> -		 * reaches its final state (online or offline).
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_CANCEL_OFFLINE:
> -		/*
> -		 * MEM_GOING_OFFLINE disabled all the migration
> -		 * targets.  Reenable them.
> -		 */
> -		__set_migration_target_nodes();
> -		break;
> -	case MEM_GOING_ONLINE:
> -	case MEM_CANCEL_ONLINE:
> -		break;
> -	}
> -
> -	return notifier_from_errno(0);
> -}
> -#endif
> -
> -void __init migrate_on_reclaim_init(void)
> -{
> -	node_demotion = kcalloc(nr_node_ids,
> -				sizeof(struct demotion_nodes),
> -				GFP_KERNEL);
> -	WARN_ON(!node_demotion);
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> -#endif
> -	/*
> -	 * At this point, all numa nodes with memory/CPus have their state
> -	 * properly set, so we can build the demotion order now.
> -	 * Let us hold the cpu_hotplug lock just, as we could possibily have
> -	 * CPU hotplug events during boot.
> -	 */
> -	cpus_read_lock();
> -	set_migration_target_nodes();
> -	cpus_read_unlock();
> -}
>  #endif /* CONFIG_NUMA */
> -
> -
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 373d2730fcf2..35c6ff97cf29 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -28,7 +28,6 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> -#include <linux/migrate.h>
>
>  #include "internal.h"
>
> @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu)
>
>  	if (!node_state(cpu_to_node(cpu), N_CPU)) {
>  		node_set_state(cpu_to_node(cpu), N_CPU);
> -		set_migration_target_nodes();
>  	}
>
>  	return 0;
> @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
>  		return 0;
>
>  	node_clear_state(node, N_CPU);
> -	set_migration_target_nodes();
>
>  	return 0;
>  }
> @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void)
>
>  	start_shepherd_timer();
>  #endif
> -	migrate_on_reclaim_init();
>  #ifdef CONFIG_PROC_FS
>  	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
>  	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-14  4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
@ 2022-07-15  5:49   ` Alistair Popple
  2022-07-15  7:19     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 29+ messages in thread
From: Alistair Popple @ 2022-07-15  5:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss


"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Also update different helpes to use NODE_DATA()->memtier. Since
> node specific memtier can change based on the reassignment of
> NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
> needs to happen under an rcu read lock or memory_tier_lock.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/mmzone.h |  3 ++
>  mm/memory-tiers.c      | 64 +++++++++++++++++++++++++++++++-----------
>  2 files changed, 50 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aab70355d64f..353812495a70 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>  	/* Per-node vmstats */
>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
> +#ifdef CONFIG_NUMA
> +	struct memory_tier __rcu *memtier;
> +#endif
>  } pg_data_t;
>
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index e951f54ce56c..bab4700bf58d 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -7,6 +7,7 @@
>  #include <linux/moduleparam.h>
>  #include <linux/memory.h>
>  #include <linux/random.h>
> +#include <linux/rcupdate.h>
>  #include <linux/memory-tiers.h>
>
>  #include "internal.h"
> @@ -124,18 +125,23 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
>  static void unregister_memory_tier(struct memory_tier *memtier)
>  {
>  	list_del(&memtier->list);
> -	kfree(memtier);
> +	kfree_rcu(memtier);
>  }
>
>  static struct memory_tier *__node_get_memory_tier(int node)
>  {
> -	struct memory_tier *memtier;
> +	pg_data_t *pgdat;
>
> -	list_for_each_entry(memtier, &memory_tiers, list) {
> -		if (node_isset(node, memtier->nodelist))
> -			return memtier;
> -	}
> -	return NULL;
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return NULL;
> +	/*
> +	 * Since we hold memory_tier_lock, we can avoid
> +	 * RCU read locks when accessing the details. No
> +	 * parallel updates are possible here.
> +	 */
> +	return rcu_dereference_check(pgdat->memtier,
> +				     lockdep_is_held(&memory_tier_lock));
>  }
>
>  static struct memory_tier *__get_memory_tier_from_id(int id)
> @@ -149,6 +155,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
>  	return NULL;
>  }
>
> +/*
> + * Called with memory_tier_lock. Hence the device references cannot
> + * be dropped during this function.
> + */
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +	struct memory_tier *current_memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> +	current_memtier = rcu_dereference_check(pgdat->memtier,
> +						lockdep_is_held(&memory_tier_lock));
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	if (current_memtier)
> +		node_clear(node, current_memtier->nodelist);

It seems odd to me that you would update the current memtier prior to
the synchronize_rcu(). I suppose it's really memory_tier_lock that
protects the details like ->nodelist, but is there any reason not do the
update after anyway?

> +	synchronize_rcu();
> +	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
>  static int __node_create_and_set_memory_tier(int node, int tier)
>  {
>  	int ret = 0;
> @@ -162,7 +195,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>  			goto out;
>  		}
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -184,14 +217,7 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	if (current_tier->id == tier)
>  		goto out;
>
> -	node_clear(node, current_tier->nodelist);
> -
>  	ret = __node_create_and_set_memory_tier(node, tier);
> -	if (ret) {
> -		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> -		goto out;
> -	}
>  	if (nodes_empty(current_tier->nodelist))
>  		unregister_memory_tier(current_tier);
>
> @@ -213,7 +239,7 @@ static int __node_set_memory_tier(int node, int tier)
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -428,6 +454,7 @@ static void __init migrate_on_reclaim_init(void)
>
>  static int __init memory_tier_init(void)
>  {
> +	int node;
>  	struct memory_tier *memtier;
>
>  	/*
> @@ -444,7 +471,10 @@ static int __init memory_tier_init(void)
>  		      __func__, PTR_ERR(memtier));
>
>  	/* CPU only nodes are not part of memory tiers. */
> -	memtier->nodelist = node_states[N_MEMORY];
> +	for_each_node_state(node, N_MEMORY) {
> +		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
> +		node_set(node, memtier->nodelist);

Similar comment here - the order seems opposite to what I'd expect.
Shouldn't memtier->nodelist be fully initialised prior to making it
visible with rcu_assign_pointer()?

> +	}
>  	mutex_unlock(&memory_tier_lock);
>
>  	migrate_on_reclaim_init();

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-15  5:49   ` Alistair Popple
@ 2022-07-15  7:19     ` Aneesh Kumar K.V
  2022-07-18  5:22       ` Alistair Popple
  0 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-15  7:19 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss

Alistair Popple <apopple@nvidia.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> Also update different helpes to use NODE_DATA()->memtier. Since
>> node specific memtier can change based on the reassignment of
>> NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
>> needs to happen under an rcu read lock or memory_tier_lock.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/mmzone.h |  3 ++
>>  mm/memory-tiers.c      | 64 +++++++++++++++++++++++++++++++-----------
>>  2 files changed, 50 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index aab70355d64f..353812495a70 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>>  	/* Per-node vmstats */
>>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>> +#ifdef CONFIG_NUMA
>> +	struct memory_tier __rcu *memtier;
>> +#endif
>>  } pg_data_t;
>>
>>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index e951f54ce56c..bab4700bf58d 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/moduleparam.h>
>>  #include <linux/memory.h>
>>  #include <linux/random.h>
>> +#include <linux/rcupdate.h>
>>  #include <linux/memory-tiers.h>
>>
>>  #include "internal.h"
>> @@ -124,18 +125,23 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
>>  static void unregister_memory_tier(struct memory_tier *memtier)
>>  {
>>  	list_del(&memtier->list);
>> -	kfree(memtier);
>> +	kfree_rcu(memtier);
>>  }
>>
>>  static struct memory_tier *__node_get_memory_tier(int node)
>>  {
>> -	struct memory_tier *memtier;
>> +	pg_data_t *pgdat;
>>
>> -	list_for_each_entry(memtier, &memory_tiers, list) {
>> -		if (node_isset(node, memtier->nodelist))
>> -			return memtier;
>> -	}
>> -	return NULL;
>> +	pgdat = NODE_DATA(node);
>> +	if (!pgdat)
>> +		return NULL;
>> +	/*
>> +	 * Since we hold memory_tier_lock, we can avoid
>> +	 * RCU read locks when accessing the details. No
>> +	 * parallel updates are possible here.
>> +	 */
>> +	return rcu_dereference_check(pgdat->memtier,
>> +				     lockdep_is_held(&memory_tier_lock));
>>  }
>>
>>  static struct memory_tier *__get_memory_tier_from_id(int id)
>> @@ -149,6 +155,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
>>  	return NULL;
>>  }
>>
>> +/*
>> + * Called with memory_tier_lock. Hence the device references cannot
>> + * be dropped during this function.
>> + */
>> +static void memtier_node_set(int node, struct memory_tier *memtier)
>> +{
>> +	pg_data_t *pgdat;
>> +	struct memory_tier *current_memtier;
>> +
>> +	pgdat = NODE_DATA(node);
>> +	if (!pgdat)
>> +		return;
>> +	/*
>> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
>> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
>> +	 * finds a NULL memtier or the one which is still valid.
>> +	 */
>> +	current_memtier = rcu_dereference_check(pgdat->memtier,
>> +						lockdep_is_held(&memory_tier_lock));
>> +	rcu_assign_pointer(pgdat->memtier, NULL);
>> +	if (current_memtier)
>> +		node_clear(node, current_memtier->nodelist);
>
> It seems odd to me that you would update the current memtier prior to
> the synchronize_rcu(). I suppose it's really memory_tier_lock that
> protects the details like ->nodelist, but is there any reason not do the
> update after anyway?

The synchronize_rcu ensures that the lockless read of pgdat->memtier
either see value NULL or a stable memtier which got current numa node in
its nodelist. IIUC what you are suggesting is we should move the
node_clear after synchronize_rcu?. I am also wondering whether I need
a smp_wmb()?

pgdat->memtier = NULL;
synchronize_rcu
remove node from memtier;
set node in new memtier
smp_wmb();
pgdat->memtier = new memtier;


>
>> +	synchronize_rcu();
>> +	node_set(node, memtier->nodelist);
>> +	rcu_assign_pointer(pgdat->memtier, memtier);
>> +}
>> +
>>  static int __node_create_and_set_memory_tier(int node, int tier)
>>  {
>>  	int ret = 0;
>> @@ -162,7 +195,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>>  			goto out;
>>  		}
>>  	}
>> -	node_set(node, memtier->nodelist);
>> +	memtier_node_set(node, memtier);
>>  out:
>>  	return ret;
>>  }
>> @@ -184,14 +217,7 @@ int node_create_and_set_memory_tier(int node, int tier)
>>  	if (current_tier->id == tier)
>>  		goto out;
>>
>> -	node_clear(node, current_tier->nodelist);
>> -
>>  	ret = __node_create_and_set_memory_tier(node, tier);
>> -	if (ret) {
>> -		/* reset it back to older tier */
>> -		node_set(node, current_tier->nodelist);
>> -		goto out;
>> -	}
>>  	if (nodes_empty(current_tier->nodelist))
>>  		unregister_memory_tier(current_tier);
>>
>> @@ -213,7 +239,7 @@ static int __node_set_memory_tier(int node, int tier)
>>  		ret = -EINVAL;
>>  		goto out;
>>  	}
>> -	node_set(node, memtier->nodelist);
>> +	memtier_node_set(node, memtier);
>>  out:
>>  	return ret;
>>  }
>> @@ -428,6 +454,7 @@ static void __init migrate_on_reclaim_init(void)
>>
>>  static int __init memory_tier_init(void)
>>  {
>> +	int node;
>>  	struct memory_tier *memtier;
>>
>>  	/*
>> @@ -444,7 +471,10 @@ static int __init memory_tier_init(void)
>>  		      __func__, PTR_ERR(memtier));
>>
>>  	/* CPU only nodes are not part of memory tiers. */
>> -	memtier->nodelist = node_states[N_MEMORY];
>> +	for_each_node_state(node, N_MEMORY) {
>> +		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
>> +		node_set(node, memtier->nodelist);
>
> Similar comment here - the order seems opposite to what I'd expect.
> Shouldn't memtier->nodelist be fully initialised prior to making it
> visible with rcu_assign_pointer()?

Will fix this. This is early during boot. So the ordering won't impact
correctness. Hence i can skip the smp_wmb()? 

>
>> +	}
>>  	mutex_unlock(&memory_tier_lock);
>>
>>  	migrate_on_reclaim_init();

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-15  4:47   ` Alistair Popple
@ 2022-07-15  7:21     ` Aneesh Kumar K.V
  2022-07-18  5:41       ` Alistair Popple
  0 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-15  7:21 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss

Alistair Popple <apopple@nvidia.com> writes:
....

> + */
>> +static void establish_migration_targets(void)
>> +{
>> +	struct memory_tier *memtier;
>> +	struct demotion_nodes *nd;
>> +	int target = NUMA_NO_NODE, node;
>> +	int distance, best_distance;
>> +	nodemask_t used;
>> +
>> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
>
> Does it make sense to include the memory tiering/demotion code if
> CONFIG_MIGRATION isn't enabled? From what I can tell none of the
> information established here is used if CONFIG_MIGRATION isn't enabled,
> so it would be better to remove the IS_ENABLED checks and not include
> the code at all.

We use the same function/codepath for updating top_tier details. We
would want to get node_is_toptier() to work even with CONFIG_MIGRATION
disabled? 

>
>> +		return;
>> +
>> +	disable_all_migrate_targets();
>> +
>> +	for_each_node_state(node, N_MEMORY) {
>> +		best_distance = -1;
>> +		nd = &node_demotion[node];
>> +
>> +		memtier = __node_get_memory_tier(node);
>> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
>> +			continue;
>> +		/*
>> +		 * Get the next memtier to find the  demotion node list.
>> +		 */
>> +		memtier = list_next_entry(memtier, list);
>> +
>> +		/*
>> +		 * find_next_best_node, use 'used' nodemask as a skip list.
>> +		 * Add all memory nodes except the selected memory tier
>> +		 * nodelist to skip list so that we find the best node from the
>> +		 * memtier nodelist.
>> +		 */
>> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
>> +
>> +		/*
>> +		 * Find all the nodes in the memory tier node list of same best distance.
>> +		 * add them to the preferred mask. We randomly select between nodes
>> +		 * in the preferred mask when allocating pages during demotion.
>> +		 */
>> +		do {
>> +			target = find_next_best_node(node, &used);
>> +			if (target == NUMA_NO_NODE)
>> +				break;
>> +
>> +			distance = node_distance(node, target);
>> +			if (distance == best_distance || best_distance == -1) {
>> +				best_distance = distance;
>> +				node_set(target, nd->preferred);
>> +			} else {
>> +				break;
>> +			}
>> +		} while (1);
>> +	}
>> +}
>> +

.....

-aneesh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined
  2022-07-15  4:38   ` Alistair Popple
@ 2022-07-15  7:23     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-15  7:23 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss

Alistair Popple <apopple@nvidia.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> If the new NUMA node onlined doesn't have a memory tier assigned,
>> the kernel adds the NUMA node to default memory tier.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  mm/memory-tiers.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 68 insertions(+)
>>
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 79347d4ab05e..5706ad647136 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/lockdep.h>
>>  #include <linux/moduleparam.h>
>> +#include <linux/memory.h>
>>  #include <linux/memory-tiers.h>
>>
>>  struct memory_tier {
>> @@ -130,8 +131,73 @@ int node_create_and_set_memory_tier(int node, int tier)
>>  }
>>  EXPORT_SYMBOL_GPL(node_create_and_set_memory_tier);
>>
>> +static int __node_set_memory_tier(int node, int tier)
>> +{
>> +	int ret = 0;
>> +	struct memory_tier *memtier;
>> +
>> +	memtier = __get_memory_tier_from_id(tier);
>> +	if (!memtier) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +	node_set(node, memtier->nodelist);
>> +out:
>> +	return ret;
>> +}
>> +
>> +static int node_set_memory_tier(int node, int tier)
>
> Minor comment, but I don't like the name of this function as it doesn't
> always set the node to the given tier.
>
> Something like this would make it clearer the tier value is only used if
> the node isn't already assigned to a tier:
>
> static int init_node_memory_tier(int node, int default_tier)
>

Will rename to init_node_memory_tier()

-aneesh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-14  4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-07-15  7:53   ` Huang, Ying
  2022-07-15  9:08     ` Aneesh Kumar K V
  2022-07-15 16:59     ` Wei Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Huang, Ying @ 2022-07-15  7:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
>
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
>
> This patch series address the above by defining memory tiers explicitly.
>
> This patch introduce explicity memory tiers. The tier ID value
> of a memory tier is used to derive the demotion order between
> NUMA nodes.
>
> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
> then the memory tier order is: memtier300 -> memtier200 -> memtier100
> where memtier300 is the highest tier and memtier100 is the lowest tier.
>
> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
> tiers when the fast(higher) tier is under memory pressure.
>
> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
> which are created by different kernel subsystems. The default memory
> tier created by the kernel is memtier200. A kernel parameter is provided
> to override the default memory tier.
>
> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 15 +++++++
>  mm/Makefile                  |  1 +
>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 94 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..a81dbc20e0d1
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#define MEMORY_TIER_HBM_GPU	300
> +#define MEMORY_TIER_DRAM	200
> +#define MEMORY_TIER_PMEM	100
> +
> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIER_ID	400
> +
> +#endif	/* CONFIG_NUMA */
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..011877b6dbb9
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/moduleparam.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	struct list_head list;
> +	int id;
> +	nodemask_t nodelist;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +
> +static void insert_memory_tier(struct memory_tier *memtier)
> +{
> +	struct list_head *ent;
> +	struct memory_tier *tmp_memtier;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	list_for_each(ent, &memory_tiers) {
> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
> +		if (tmp_memtier->id < memtier->id) {
> +			list_add_tail(&memtier->list, ent);
> +			return;
> +		}
> +	}
> +	list_add_tail(&memtier->list, &memory_tiers);
> +}
> +
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +	struct memory_tier *memtier;
> +
> +	if (tier > MAX_MEMORY_TIER_ID)
> +		return ERR_PTR(-EINVAL);
> +
> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memtier->id   = tier;
> +
> +	insert_memory_tier(memtier);
> +
> +	return memtier;
> +}
> +
> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
> +core_param(default_memory_tier, default_memtier, uint, 0644);
> +
> +static int __init memory_tier_init(void)
> +{
> +	struct memory_tier *memtier;
> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs. Since this is early during
> +	 * boot, we could avoid holding memtory_tier_lock. But
> +	 * keep it simple by holding locks. So we can add lock
> +	 * held debug checks in other functions.
> +	 */
> +	mutex_lock(&memory_tier_lock);
> +	memtier = register_memory_tier(default_memtier);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +
> +	/* CPU only nodes are not part of memory tiers. */
> +	memtier->nodelist = node_states[N_MEMORY];
> +	mutex_unlock(&memory_tier_lock);
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

You dropped the original sysfs interface patches from the series, but
the kernel internal implementation is still for the original sysfs
interface.  For example, memory tier ID is for the original sysfs
interface, not for the new proposed sysfs interface.  So I suggest you
to implement with the new interface in mind.  What do you think about
the following design?

- Each NUMA node belongs to a memory type, and each memory type
  corresponds to a "abstract distance", so each NUMA node corresonds to
  a "distance".  For simplicity, we can start with static distances, for
  example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
  node can be recorded in a global array,

    int node_distances[MAX_NUMNODES];

  or, just

    pgdat->distance

- Each memory tier corresponds to a range of distance, for example,
  0-100, 100-200, 200-300, >300, we can start with static ranges too.

- The core API of memory tier could be

    struct memory_tier *find_create_memory_tier(int distance);

  it will find the memory tier which covers "distance" in the memory
  tier list, or create a new memory tier if not found.

- kmem_dax driver will setup distance for PMEM NUMA nodes before online
  them.

- When a NUMA node is onlined, we will use find_create_memory_tier() to
  find or create its memory tier and add the NUMA node into the memory
  tier.

- Or we can add memory type data structure now.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15  7:53   ` Huang, Ying
@ 2022-07-15  9:08     ` Aneesh Kumar K V
  2022-07-15  9:24       ` Aneesh Kumar K V
                         ` (2 more replies)
  2022-07-15 16:59     ` Wei Xu
  1 sibling, 3 replies; 29+ messages in thread
From: Aneesh Kumar K V @ 2022-07-15  9:08 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

On 7/15/22 1:23 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created
>> during the kernel initialization and updated when a NUMA node is
>> hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and builds the tier hierarchy
>> tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases,
>>
>> The current tier initialization code always initializes
>> each memory-only NUMA node into a lower tier.  But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top
>> tier. But on a system with HBM or GPU devices, the
>> memory-only NUMA nodes mapping these devices should be in the
>> top tier, and DRAM nodes with CPUs are better to be placed into the
>> next lower tier.
>>
>> With current kernel higher tier node can only be demoted to selected nodes on the
>> next lower tier as defined by the demotion path, not any other
>> node from any lower tier.  This strict, hard-coded demotion order
>> does not work in all use cases (e.g. some use cases may want to
>> allow cross-socket demotion to another node in the same demotion
>> tier as a fallback when the preferred demotion node is out of
>> space), This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from
>> any lower tier, whereas the demotion order doesn't allow that.
>>
>> The current kernel also don't provide any interfaces for the
>> userspace to learn about the memory tier hierarchy in order to
>> optimize its memory allocations.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> This patch introduce explicity memory tiers. The tier ID value
>> of a memory tier is used to derive the demotion order between
>> NUMA nodes.
>>
>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>
>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>> tiers when the fast(higher) tier is under memory pressure.
>>
>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>> which are created by different kernel subsystems. The default memory
>> tier created by the kernel is memtier200. A kernel parameter is provided
>> to override the default memory tier.
>>
>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h | 15 +++++++
>>  mm/Makefile                  |  1 +
>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>  3 files changed, 94 insertions(+)
>>  create mode 100644 include/linux/memory-tiers.h
>>  create mode 100644 mm/memory-tiers.c
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> new file mode 100644
>> index 000000000000..a81dbc20e0d1
>> --- /dev/null
>> +++ b/include/linux/memory-tiers.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_NUMA
>> +
>> +#define MEMORY_TIER_HBM_GPU	300
>> +#define MEMORY_TIER_DRAM	200
>> +#define MEMORY_TIER_PMEM	100
>> +
>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIER_ID	400
>> +
>> +#endif	/* CONFIG_NUMA */
>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 6f9ffa968a1a..d30acebc2164 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>>  obj-$(CONFIG_MIGRATION) += migrate.o
>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> new file mode 100644
>> index 000000000000..011877b6dbb9
>> --- /dev/null
>> +++ b/mm/memory-tiers.c
>> @@ -0,0 +1,78 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/types.h>
>> +#include <linux/nodemask.h>
>> +#include <linux/slab.h>
>> +#include <linux/lockdep.h>
>> +#include <linux/moduleparam.h>
>> +#include <linux/memory-tiers.h>
>> +
>> +struct memory_tier {
>> +	struct list_head list;
>> +	int id;
>> +	nodemask_t nodelist;
>> +};
>> +
>> +static DEFINE_MUTEX(memory_tier_lock);
>> +static LIST_HEAD(memory_tiers);
>> +
>> +static void insert_memory_tier(struct memory_tier *memtier)
>> +{
>> +	struct list_head *ent;
>> +	struct memory_tier *tmp_memtier;
>> +
>> +	lockdep_assert_held_once(&memory_tier_lock);
>> +
>> +	list_for_each(ent, &memory_tiers) {
>> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
>> +		if (tmp_memtier->id < memtier->id) {
>> +			list_add_tail(&memtier->list, ent);
>> +			return;
>> +		}
>> +	}
>> +	list_add_tail(&memtier->list, &memory_tiers);
>> +}
>> +
>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	if (tier > MAX_MEMORY_TIER_ID)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> +	if (!memtier)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	memtier->id   = tier;
>> +
>> +	insert_memory_tier(memtier);
>> +
>> +	return memtier;
>> +}
>> +
>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>> +
>> +static int __init memory_tier_init(void)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	/*
>> +	 * Register only default memory tier to hide all empty
>> +	 * memory tier from sysfs. Since this is early during
>> +	 * boot, we could avoid holding memtory_tier_lock. But
>> +	 * keep it simple by holding locks. So we can add lock
>> +	 * held debug checks in other functions.
>> +	 */
>> +	mutex_lock(&memory_tier_lock);
>> +	memtier = register_memory_tier(default_memtier);
>> +	if (IS_ERR(memtier))
>> +		panic("%s() failed to register memory tier: %ld\n",
>> +		      __func__, PTR_ERR(memtier));
>> +
>> +	/* CPU only nodes are not part of memory tiers. */
>> +	memtier->nodelist = node_states[N_MEMORY];
>> +	mutex_unlock(&memory_tier_lock);
>> +	return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
> 
> You dropped the original sysfs interface patches from the series, but
> the kernel internal implementation is still for the original sysfs
> interface.  For example, memory tier ID is for the original sysfs
> interface, not for the new proposed sysfs interface.  So I suggest you
> to implement with the new interface in mind.  What do you think about
> the following design?
> 

Sorry I am not able to follow you here. This patchset completely drops
exposing memory tiers to userspace via sysfs. Instead it allow
creation of memory tiers with specific tierID from within the kernel/device driver.
Default tierID is 200 and dax kmem creates memory tier with tierID 100. 


> - Each NUMA node belongs to a memory type, and each memory type
>   corresponds to a "abstract distance", so each NUMA node corresonds to
>   a "distance".  For simplicity, we can start with static distances, for
>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>   node can be recorded in a global array,
> 
>     int node_distances[MAX_NUMNODES];
> 
>   or, just
> 
>     pgdat->distance
> 

I don't follow this. I guess you are trying to have a different design.
Would it be much easier if you can write this in the form of a patch? 


> - Each memory tier corresponds to a range of distance, for example,
>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
> 
> - The core API of memory tier could be
> 
>     struct memory_tier *find_create_memory_tier(int distance);
> 
>   it will find the memory tier which covers "distance" in the memory
>   tier list, or create a new memory tier if not found.
> 

I was expecting this to be internal to dax kmem. How dax kmem maps
"abstract distance" to a memory tier. At this point this patchset is
keeping all that for a future patchset. 

> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>   them.
> 

Sure we can do that as part of future patchset ?

> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>   find or create its memory tier and add the NUMA node into the memory
>   tier.
> 

This is what this patchset does. When we online a numa node the kernel 
find the memory tier for the node (__node_get_memory_tier). If it doesn't
exist, we create one. (The new one created is not dynamic as you outlined
earlier. But then that can be done in a future patchset). For now I am
keeping this simpler.

static int node_set_memory_tier(int node, int tier)
{
	struct memory_tier *memtier;
	int ret = 0;

	mutex_lock(&memory_tier_lock);
	memtier = __node_get_memory_tier(node);
	/*
	 * if node is already part of the tier proceed with the
	 * current tier value, because we might want to establish
	 * new migration paths now. The node might be added to a tier
	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
	 * will have skipped this node.
	 */
	if (!memtier)
		ret = __node_set_memory_tier(node, tier);
	establish_migration_targets();

	mutex_unlock(&memory_tier_lock);

	return ret;
}





> - Or we can add memory type data structure now.
> 
> Best Regards,
> Huang, Ying

-aneesh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15  9:08     ` Aneesh Kumar K V
@ 2022-07-15  9:24       ` Aneesh Kumar K V
  2022-07-15 10:27       ` Aneesh Kumar K.V
  2022-07-18  6:57       ` Huang, Ying
  2 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K V @ 2022-07-15  9:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

On 7/15/22 2:38 PM, Aneesh Kumar K V wrote:
> On 7/15/22 1:23 PM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created
>>> during the kernel initialization and updated when a NUMA node is
>>> hot-added or hot-removed.  The current implementation puts all
>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>> tier-by-tier by establishing the per-node demotion targets based
>>> on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases,
>>>
>>> The current tier initialization code always initializes
>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>> a virtual machine) and should be put into a higher tier.
>>>
>>> The current tier hierarchy always puts CPU nodes into the top
>>> tier. But on a system with HBM or GPU devices, the
>>> memory-only NUMA nodes mapping these devices should be in the
>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>> next lower tier.
>>>
>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>> next lower tier as defined by the demotion path, not any other
>>> node from any lower tier.  This strict, hard-coded demotion order
>>> does not work in all use cases (e.g. some use cases may want to
>>> allow cross-socket demotion to another node in the same demotion
>>> tier as a fallback when the preferred demotion node is out of
>>> space), This demotion order is also inconsistent with the page
>>> allocation fallback order when all the nodes in a higher tier are
>>> out of space: The page allocation can fall back to any node from
>>> any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> The current kernel also don't provide any interfaces for the
>>> userspace to learn about the memory tier hierarchy in order to
>>> optimize its memory allocations.
>>>
>>> This patch series address the above by defining memory tiers explicitly.
>>>
>>> This patch introduce explicity memory tiers. The tier ID value
>>> of a memory tier is used to derive the demotion order between
>>> NUMA nodes.
>>>
>>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>>
>>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>> tiers when the fast(higher) tier is under memory pressure.
>>>
>>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>> which are created by different kernel subsystems. The default memory
>>> tier created by the kernel is memtier200. A kernel parameter is provided
>>> to override the default memory tier.
>>>
>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>
>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/memory-tiers.h | 15 +++++++
>>>  mm/Makefile                  |  1 +
>>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 94 insertions(+)
>>>  create mode 100644 include/linux/memory-tiers.h
>>>  create mode 100644 mm/memory-tiers.c
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> new file mode 100644
>>> index 000000000000..a81dbc20e0d1
>>> --- /dev/null
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -0,0 +1,15 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>> +#define _LINUX_MEMORY_TIERS_H
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +
>>> +#define MEMORY_TIER_HBM_GPU	300
>>> +#define MEMORY_TIER_DRAM	200
>>> +#define MEMORY_TIER_PMEM	100
>>> +
>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>> +#define MAX_MEMORY_TIER_ID	400
>>> +
>>> +#endif	/* CONFIG_NUMA */
>>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index 6f9ffa968a1a..d30acebc2164 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>>>  obj-$(CONFIG_MIGRATION) += migrate.o
>>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> new file mode 100644
>>> index 000000000000..011877b6dbb9
>>> --- /dev/null
>>> +++ b/mm/memory-tiers.c
>>> @@ -0,0 +1,78 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +#include <linux/types.h>
>>> +#include <linux/nodemask.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/lockdep.h>
>>> +#include <linux/moduleparam.h>
>>> +#include <linux/memory-tiers.h>
>>> +
>>> +struct memory_tier {
>>> +	struct list_head list;
>>> +	int id;
>>> +	nodemask_t nodelist;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(memory_tier_lock);
>>> +static LIST_HEAD(memory_tiers);
>>> +
>>> +static void insert_memory_tier(struct memory_tier *memtier)
>>> +{
>>> +	struct list_head *ent;
>>> +	struct memory_tier *tmp_memtier;
>>> +
>>> +	lockdep_assert_held_once(&memory_tier_lock);
>>> +
>>> +	list_for_each(ent, &memory_tiers) {
>>> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
>>> +		if (tmp_memtier->id < memtier->id) {
>>> +			list_add_tail(&memtier->list, ent);
>>> +			return;
>>> +		}
>>> +	}
>>> +	list_add_tail(&memtier->list, &memory_tiers);
>>> +}
>>> +
>>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	if (tier > MAX_MEMORY_TIER_ID)
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>> +	if (!memtier)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	memtier->id   = tier;
>>> +
>>> +	insert_memory_tier(memtier);
>>> +
>>> +	return memtier;
>>> +}
>>> +
>>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>>> +
>>> +static int __init memory_tier_init(void)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	/*
>>> +	 * Register only default memory tier to hide all empty
>>> +	 * memory tier from sysfs. Since this is early during
>>> +	 * boot, we could avoid holding memtory_tier_lock. But
>>> +	 * keep it simple by holding locks. So we can add lock
>>> +	 * held debug checks in other functions.
>>> +	 */
>>> +	mutex_lock(&memory_tier_lock);
>>> +	memtier = register_memory_tier(default_memtier);
>>> +	if (IS_ERR(memtier))
>>> +		panic("%s() failed to register memory tier: %ld\n",
>>> +		      __func__, PTR_ERR(memtier));
>>> +
>>> +	/* CPU only nodes are not part of memory tiers. */
>>> +	memtier->nodelist = node_states[N_MEMORY];
>>> +	mutex_unlock(&memory_tier_lock);
>>> +	return 0;
>>> +}
>>> +subsys_initcall(memory_tier_init);
>>
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>>
> 
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
> 
> 
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>>
>>     int node_distances[MAX_NUMNODES];
>>
>>   or, just
>>
>>     pgdat->distance
>>
> 
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 
> 
> 
>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>
>> - The core API of memory tier could be
>>
>>     struct memory_tier *find_create_memory_tier(int distance);
>>
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>>
> 
> I was expecting this to be internal to dax kmem. How dax kmem maps
> "abstract distance" to a memory tier. At this point this patchset is
> keeping all that for a future patchset. 
> 

At an abstract level, something like this.

modified   drivers/dax/kmem.c
@@ -150,7 +150,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
+	this_device_tier = find_memtier_from_distance(dev_dax);
+	node_create_and_set_memory_tier(numa_node, this_device_tier);
 	return 0;
 
 err_request_mem:



>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>   them.
>>
> 
> Sure we can do that as part of future patchset ?
> 
>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>   find or create its memory tier and add the NUMA node into the memory
>>   tier.
>>
> 
> This is what this patchset does. When we online a numa node the kernel 
> find the memory tier for the node (__node_get_memory_tier). If it doesn't
> exist, we create one. (The new one created is not dynamic as you outlined
> earlier. But then that can be done in a future patchset). For now I am
> keeping this simpler.
> 
> static int node_set_memory_tier(int node, int tier)
> {
> 	struct memory_tier *memtier;
> 	int ret = 0;
> 
> 	mutex_lock(&memory_tier_lock);
> 	memtier = __node_get_memory_tier(node);
> 	/*
> 	 * if node is already part of the tier proceed with the
> 	 * current tier value, because we might want to establish
> 	 * new migration paths now. The node might be added to a tier
> 	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> 	 * will have skipped this node.
> 	 */
> 	if (!memtier)
> 		ret = __node_set_memory_tier(node, tier);
> 	establish_migration_targets();
> 
> 	mutex_unlock(&memory_tier_lock);
> 
> 	return ret;
> }
> 
> 
> 
> 
> 
>> - Or we can add memory type data structure now.
>>
>> Best Regards,
>> Huang, Ying
> 
> -aneesh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15  9:08     ` Aneesh Kumar K V
  2022-07-15  9:24       ` Aneesh Kumar K V
@ 2022-07-15 10:27       ` Aneesh Kumar K.V
  2022-07-18  6:08         ` Huang, Ying
  2022-07-18  6:57       ` Huang, Ying
  2 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K.V @ 2022-07-15 10:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

....

> 
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>> 
>
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>
>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>> 
>>     int node_distances[MAX_NUMNODES];
>> 
>>   or, just
>> 
>>     pgdat->distance
>> 
>
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 
>
>
>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>> 
>> - The core API of memory tier could be
>> 
>>     struct memory_tier *find_create_memory_tier(int distance);
>> 
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>> 
>
> I was expecting this to be internal to dax kmem. How dax kmem maps
> "abstract distance" to a memory tier. At this point this patchset is
> keeping all that for a future patchset. 
>

This shows how i was expecting "abstract distance" to be integrated.

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 82cae08976bc..1281aec63986 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -1332,6 +1332,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	ndr_desc.mapping = &mapping;
 	ndr_desc.num_mappings = 1;
 	ndr_desc.nd_set = &p->nd_set;
+	ndr_desc.memtier_distance = PMEM_MEMTIER_DEFAULT_DISTANCE;
 
 	if (p->hcall_flush_required) {
 		set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..7b8cf1f15562 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2641,6 +2641,10 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			NUMA_NO_NODE, ndr_desc->numa_node, &res.start, &res.end);
 	}
 
+	/*
+	 * We may want to look at SLIT/HMAT to fine tune this
+	 */
+	ndr_desc->memtier_distance  =  PMEM_MEMTIER_DEFAULT_DISTANCE;
 	/*
 	 * Persistence domain bits are hierarchical, if
 	 * ACPI_NFIT_CAPABILITY_CACHE_FLUSH is set then
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..708a40cf29c0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -570,8 +570,9 @@ static void dax_region_unregister(void *region)
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+				    struct range *range, int target_node,
+				    int memtier_distance, unsigned int align,
+				    unsigned long flags)
 {
 	struct dax_region *dax_region;
 
@@ -599,6 +600,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->target_node = target_node;
+	dax_region->memtier_distance = memtier_distance;
 	ida_init(&dax_region->ida);
 	dax_region->res = (struct resource) {
 		.start = range->start,
@@ -1370,6 +1372,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 
 	dev_dax->dax_dev = dax_dev;
 	dev_dax->target_node = dax_region->target_node;
+	dev_dax->memtier_distance = dax_region->memtier_distance;
 	dev_dax->align = dax_region->align;
 	ida_init(&dev_dax->ida);
 	kref_get(&dax_region->kref);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index fbb940293d6d..3de4292392dd 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,8 +13,9 @@ void dax_region_put(struct dax_region *dax_region);
 
 #define IORESOURCE_DAX_STATIC (1UL << 0)
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+				    struct range *range, int target_node,
+				    int memtier_distance, unsigned int align,
+				    unsigned long flags);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 1c974b7caae6..5db382c78d0e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -31,6 +31,7 @@ void dax_bus_exit(void);
 struct dax_region {
 	int id;
 	int target_node;
+	int memtier_distance;
 	struct kref kref;
 	struct device *dev;
 	unsigned int align;
@@ -64,6 +65,7 @@ struct dev_dax {
 	struct dax_device *dax_dev;
 	unsigned int align;
 	int target_node;
+	int memtier_distance;
 	int id;
 	struct ida ida;
 	struct device dev;
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1bf040dbc834..b9f80971c07b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 	range.start = res->start;
 	range.end = res->end;
 	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
-			PMD_SIZE, 0);
+				      mri->memtier_distance, PMD_SIZE, 0);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0c03889286ac..32878bd96f09 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -45,13 +45,18 @@ struct dax_kmem_data {
 static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM;
 module_param(dax_kmem_memtier, uint, 0644);
 
+int find_memtier_from_distance(struct dev_dax *dev_dax)
+{
+	return dax_kmem_memtier + dev_dax->memtier_distance;
+}
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long total_len = 0;
 	struct dax_kmem_data *data;
 	int i, rc, mapped = 0;
-	int numa_node;
+	int numa_node, mem_tier;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -150,7 +155,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
+	mem_tier = find_memtier_from_distance(dev_dax);
+	node_create_and_set_memory_tier(numa_node, mem_tier);
 	return 0;
 
 err_request_mem:
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f050ea78bb83..1b51fc0490de 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,8 +54,10 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range = pgmap.range;
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
-			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+				      nd_region->target_node,
+				      nd_region->memtier_distance,
+				      le32_to_cpu(pfn_sb->align),
+				      IORESOURCE_DAX_STATIC);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index ec5219680092..cf7a379a2220 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -416,6 +416,7 @@ struct nd_region {
 	u64 ndr_size;
 	u64 ndr_start;
 	int id, num_lanes, ro, numa_node, target_node;
+	int memtier_distance;
 	void *provider_data;
 	struct kernfs_node *bb_state;
 	struct badblocks bb;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index d976260eca7a..f2067de8d660 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1019,6 +1019,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	nd_region->ro = ro;
 	nd_region->numa_node = ndr_desc->numa_node;
 	nd_region->target_node = ndr_desc->target_node;
+	nd_region->memtier_distance = ndr_desc->memtier_distance;
 	ida_init(&nd_region->ns_ida);
 	ida_init(&nd_region->btt_ida);
 	ida_init(&nd_region->pfn_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 0d61e07b6827..bf20e018074f 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -121,6 +121,7 @@ struct nd_region_desc {
 	int num_lanes;
 	int numa_node;
 	int target_node;
+	int memtier_distance;
 	unsigned long flags;
 	struct device_node *of_node;
 	int (*flush)(struct nd_region *nd_region, struct bio *bio);
@@ -224,6 +225,8 @@ struct nvdimm_fw_ops {
 	int (*arm)(struct nvdimm *nvdimm, enum nvdimm_fwa_trigger arg);
 };
 
+#define PMEM_MEMTIER_DEFAULT_DISTANCE  10
+
 void badrange_init(struct badrange *badrange);
 int badrange_add(struct badrange *badrange, u64 addr, u64 length);
 void badrange_forget(struct badrange *badrange, phys_addr_t start,
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..5850e2bbbfed 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -6,6 +6,7 @@
 
 struct memregion_info {
 	int target_node;
+	int memtier_distance;
 };
 
 #ifdef CONFIG_MEMREGION

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15  7:53   ` Huang, Ying
  2022-07-15  9:08     ` Aneesh Kumar K V
@ 2022-07-15 16:59     ` Wei Xu
  2022-07-18  5:28       ` Huang, Ying
  1 sibling, 1 reply; 29+ messages in thread
From: Wei Xu @ 2022-07-15 16:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Aneesh Kumar K.V, Linux MM, Andrew Morton, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Johannes Weiner,
	jvgediya.oss, Jagdish Gediya

On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> >
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> >
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> >
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> >
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> >
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> >
> > This patch series address the above by defining memory tiers explicitly.
> >
> > This patch introduce explicity memory tiers. The tier ID value
> > of a memory tier is used to derive the demotion order between
> > NUMA nodes.
> >
> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
> > where memtier300 is the highest tier and memtier100 is the lowest tier.
> >
> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
> > tiers when the fast(higher) tier is under memory pressure.
> >
> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
> > which are created by different kernel subsystems. The default memory
> > tier created by the kernel is memtier200. A kernel parameter is provided
> > to override the default memory tier.
> >
> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> >  include/linux/memory-tiers.h | 15 +++++++
> >  mm/Makefile                  |  1 +
> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
> >  3 files changed, 94 insertions(+)
> >  create mode 100644 include/linux/memory-tiers.h
> >  create mode 100644 mm/memory-tiers.c
> >
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > new file mode 100644
> > index 000000000000..a81dbc20e0d1
> > --- /dev/null
> > +++ b/include/linux/memory-tiers.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +#define MEMORY_TIER_HBM_GPU  300
> > +#define MEMORY_TIER_DRAM     200
> > +#define MEMORY_TIER_PMEM     100
> > +
> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIER_ID   400
> > +
> > +#endif       /* CONFIG_NUMA */
> > +#endif  /* _LINUX_MEMORY_TIERS_H */
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 6f9ffa968a1a..d30acebc2164 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> >  obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMTEST)                += memtest.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > new file mode 100644
> > index 000000000000..011877b6dbb9
> > --- /dev/null
> > +++ b/mm/memory-tiers.c
> > @@ -0,0 +1,78 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/types.h>
> > +#include <linux/nodemask.h>
> > +#include <linux/slab.h>
> > +#include <linux/lockdep.h>
> > +#include <linux/moduleparam.h>
> > +#include <linux/memory-tiers.h>
> > +
> > +struct memory_tier {
> > +     struct list_head list;
> > +     int id;
> > +     nodemask_t nodelist;
> > +};
> > +
> > +static DEFINE_MUTEX(memory_tier_lock);
> > +static LIST_HEAD(memory_tiers);
> > +
> > +static void insert_memory_tier(struct memory_tier *memtier)
> > +{
> > +     struct list_head *ent;
> > +     struct memory_tier *tmp_memtier;
> > +
> > +     lockdep_assert_held_once(&memory_tier_lock);
> > +
> > +     list_for_each(ent, &memory_tiers) {
> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
> > +             if (tmp_memtier->id < memtier->id) {
> > +                     list_add_tail(&memtier->list, ent);
> > +                     return;
> > +             }
> > +     }
> > +     list_add_tail(&memtier->list, &memory_tiers);
> > +}
> > +
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +     struct memory_tier *memtier;
> > +
> > +     if (tier > MAX_MEMORY_TIER_ID)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +     if (!memtier)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     memtier->id   = tier;
> > +
> > +     insert_memory_tier(memtier);
> > +
> > +     return memtier;
> > +}
> > +
> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
> > +core_param(default_memory_tier, default_memtier, uint, 0644);
> > +
> > +static int __init memory_tier_init(void)
> > +{
> > +     struct memory_tier *memtier;
> > +
> > +     /*
> > +      * Register only default memory tier to hide all empty
> > +      * memory tier from sysfs. Since this is early during
> > +      * boot, we could avoid holding memtory_tier_lock. But
> > +      * keep it simple by holding locks. So we can add lock
> > +      * held debug checks in other functions.
> > +      */
> > +     mutex_lock(&memory_tier_lock);
> > +     memtier = register_memory_tier(default_memtier);
> > +     if (IS_ERR(memtier))
> > +             panic("%s() failed to register memory tier: %ld\n",
> > +                   __func__, PTR_ERR(memtier));
> > +
> > +     /* CPU only nodes are not part of memory tiers. */
> > +     memtier->nodelist = node_states[N_MEMORY];
> > +     mutex_unlock(&memory_tier_lock);
> > +     return 0;
> > +}
> > +subsys_initcall(memory_tier_init);
>
> You dropped the original sysfs interface patches from the series, but
> the kernel internal implementation is still for the original sysfs
> interface.  For example, memory tier ID is for the original sysfs
> interface, not for the new proposed sysfs interface.  So I suggest you
> to implement with the new interface in mind.  What do you think about
> the following design?
>
> - Each NUMA node belongs to a memory type, and each memory type
>   corresponds to a "abstract distance", so each NUMA node corresonds to
>   a "distance".  For simplicity, we can start with static distances, for
>   example, DRAM (default): 150, PMEM: 250.

I agree with this design, though I'd prefer the new attribute to not
be named as "distance".  This is to both avoid the confusion with the
SLIT distance and to avoid the misconception that only the latency
matters, but the bandwidth doesn't.

How about we call it "performance level" (perf_level) or something
similar instead?

> The distance of each NUMA
>   node can be recorded in a global array,
>
>     int node_distances[MAX_NUMNODES];
>
>   or, just
>
>     pgdat->distance

I think node_devices[] is a better place to record this new attribute.
The HMAT performance data is also listed there.

> - Each memory tier corresponds to a range of distance, for example,
>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>
> - The core API of memory tier could be
>
>     struct memory_tier *find_create_memory_tier(int distance);
>
>   it will find the memory tier which covers "distance" in the memory
>   tier list, or create a new memory tier if not found.
>
> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>   them.

This attribute should be a property of the NUMA node based on the
device hardware.  For PMEM, it is better to handle at the ACPI level.
For example, we can consider initializing this attribute for a PMEM
node in acpi_numa_memory_affinity_init() when the node is
non-volatile.

> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>   find or create its memory tier and add the NUMA node into the memory
>   tier.

I think we should create all the memory tiers up-front, just like NUMA
nodes, to keep their devices and IDs stable.  Similar to offline NUMA
nodes, when a memory tier has no online nodes, we can mark it as
offline and exclude it from online-related operations (e.g. demotion).
A memory tier can be made online when it gets assigned with an online
node.

> - Or we can add memory type data structure now.
>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details
  2022-07-15  7:19     ` Aneesh Kumar K.V
@ 2022-07-18  5:22       ` Alistair Popple
  0 siblings, 0 replies; 29+ messages in thread
From: Alistair Popple @ 2022-07-18  5:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss


"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> Also update different helpes to use NODE_DATA()->memtier. Since
>>> node specific memtier can change based on the reassignment of
>>> NUMA node to a different memory tiers, accessing NODE_DATA()->memtier
>>> needs to happen under an rcu read lock or memory_tier_lock.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/mmzone.h |  3 ++
>>>  mm/memory-tiers.c      | 64 +++++++++++++++++++++++++++++++-----------
>>>  2 files changed, 50 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index aab70355d64f..353812495a70 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>>>  	/* Per-node vmstats */
>>>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>>>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>>> +#ifdef CONFIG_NUMA
>>> +	struct memory_tier __rcu *memtier;
>>> +#endif
>>>  } pg_data_t;
>>>
>>>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> index e951f54ce56c..bab4700bf58d 100644
>>> --- a/mm/memory-tiers.c
>>> +++ b/mm/memory-tiers.c
>>> @@ -7,6 +7,7 @@
>>>  #include <linux/moduleparam.h>
>>>  #include <linux/memory.h>
>>>  #include <linux/random.h>
>>> +#include <linux/rcupdate.h>
>>>  #include <linux/memory-tiers.h>
>>>
>>>  #include "internal.h"
>>> @@ -124,18 +125,23 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
>>>  static void unregister_memory_tier(struct memory_tier *memtier)
>>>  {
>>>  	list_del(&memtier->list);
>>> -	kfree(memtier);
>>> +	kfree_rcu(memtier);
>>>  }
>>>
>>>  static struct memory_tier *__node_get_memory_tier(int node)
>>>  {
>>> -	struct memory_tier *memtier;
>>> +	pg_data_t *pgdat;
>>>
>>> -	list_for_each_entry(memtier, &memory_tiers, list) {
>>> -		if (node_isset(node, memtier->nodelist))
>>> -			return memtier;
>>> -	}
>>> -	return NULL;
>>> +	pgdat = NODE_DATA(node);
>>> +	if (!pgdat)
>>> +		return NULL;
>>> +	/*
>>> +	 * Since we hold memory_tier_lock, we can avoid
>>> +	 * RCU read locks when accessing the details. No
>>> +	 * parallel updates are possible here.
>>> +	 */
>>> +	return rcu_dereference_check(pgdat->memtier,
>>> +				     lockdep_is_held(&memory_tier_lock));
>>>  }
>>>
>>>  static struct memory_tier *__get_memory_tier_from_id(int id)
>>> @@ -149,6 +155,33 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
>>>  	return NULL;
>>>  }
>>>
>>> +/*
>>> + * Called with memory_tier_lock. Hence the device references cannot
>>> + * be dropped during this function.
>>> + */
>>> +static void memtier_node_set(int node, struct memory_tier *memtier)
>>> +{
>>> +	pg_data_t *pgdat;
>>> +	struct memory_tier *current_memtier;
>>> +
>>> +	pgdat = NODE_DATA(node);
>>> +	if (!pgdat)
>>> +		return;
>>> +	/*
>>> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
>>> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
>>> +	 * finds a NULL memtier or the one which is still valid.
>>> +	 */
>>> +	current_memtier = rcu_dereference_check(pgdat->memtier,
>>> +						lockdep_is_held(&memory_tier_lock));
>>> +	rcu_assign_pointer(pgdat->memtier, NULL);
>>> +	if (current_memtier)
>>> +		node_clear(node, current_memtier->nodelist);
>>
>> It seems odd to me that you would update the current memtier prior to
>> the synchronize_rcu(). I suppose it's really memory_tier_lock that
>> protects the details like ->nodelist, but is there any reason not do the
>> update after anyway?
>
> The synchronize_rcu ensures that the lockless read of pgdat->memtier
> either see value NULL or a stable memtier which got current numa node in
> its nodelist. IIUC what you are suggesting is we should move the
> node_clear after synchronize_rcu?. I am also wondering whether I need
> a smp_wmb()?

rcu_assign_pointer() includes the appropriate barriers to ensure any
initialisation will be visible, so I don't believe you need smp_wmb().

> pgdat->memtier = NULL;
> synchronize_rcu
> remove node from memtier;
> set node in new memtier
> smp_wmb();
> pgdat->memtier = new memtier;

Yeah, that's what I was suggesting. Although to be clear I don't think
there was actually a correctness issue with what you had, because
memtier->nodelist is protected by memory_tier_lock anyway and not
accessed under the rcu_read_lock().

It just looked a little odd IMHO to be updating something that was still
potentially being used prior to synchronize_rcu() completing.

>>
>>> +	synchronize_rcu();
>>> +	node_set(node, memtier->nodelist);
>>> +	rcu_assign_pointer(pgdat->memtier, memtier);
>>> +}
>>> +
>>>  static int __node_create_and_set_memory_tier(int node, int tier)
>>>  {
>>>  	int ret = 0;
>>> @@ -162,7 +195,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>>>  			goto out;
>>>  		}
>>>  	}
>>> -	node_set(node, memtier->nodelist);
>>> +	memtier_node_set(node, memtier);
>>>  out:
>>>  	return ret;
>>>  }
>>> @@ -184,14 +217,7 @@ int node_create_and_set_memory_tier(int node, int tier)
>>>  	if (current_tier->id == tier)
>>>  		goto out;
>>>
>>> -	node_clear(node, current_tier->nodelist);
>>> -
>>>  	ret = __node_create_and_set_memory_tier(node, tier);
>>> -	if (ret) {
>>> -		/* reset it back to older tier */
>>> -		node_set(node, current_tier->nodelist);
>>> -		goto out;
>>> -	}
>>>  	if (nodes_empty(current_tier->nodelist))
>>>  		unregister_memory_tier(current_tier);
>>>
>>> @@ -213,7 +239,7 @@ static int __node_set_memory_tier(int node, int tier)
>>>  		ret = -EINVAL;
>>>  		goto out;
>>>  	}
>>> -	node_set(node, memtier->nodelist);
>>> +	memtier_node_set(node, memtier);
>>>  out:
>>>  	return ret;
>>>  }
>>> @@ -428,6 +454,7 @@ static void __init migrate_on_reclaim_init(void)
>>>
>>>  static int __init memory_tier_init(void)
>>>  {
>>> +	int node;
>>>  	struct memory_tier *memtier;
>>>
>>>  	/*
>>> @@ -444,7 +471,10 @@ static int __init memory_tier_init(void)
>>>  		      __func__, PTR_ERR(memtier));
>>>
>>>  	/* CPU only nodes are not part of memory tiers. */
>>> -	memtier->nodelist = node_states[N_MEMORY];
>>> +	for_each_node_state(node, N_MEMORY) {
>>> +		rcu_assign_pointer(NODE_DATA(node)->memtier, memtier);
>>> +		node_set(node, memtier->nodelist);
>>
>> Similar comment here - the order seems opposite to what I'd expect.
>> Shouldn't memtier->nodelist be fully initialised prior to making it
>> visible with rcu_assign_pointer()?
>
> Will fix this. This is early during boot. So the ordering won't impact
> correctness. Hence i can skip the smp_wmb()?

Yeah, rcu_assign_pointer() should include appropriate barriers anyway.

>>
>>> +	}
>>>  	mutex_unlock(&memory_tier_lock);
>>>
>>>  	migrate_on_reclaim_init();

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15 16:59     ` Wei Xu
@ 2022-07-18  5:28       ` Huang, Ying
  2022-07-18  5:58         ` Alistair Popple
  0 siblings, 1 reply; 29+ messages in thread
From: Huang, Ying @ 2022-07-18  5:28 UTC (permalink / raw)
  To: Wei Xu
  Cc: Aneesh Kumar K.V, Linux MM, Andrew Morton, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Johannes Weiner,
	jvgediya.oss, Jagdish Gediya

Wei Xu <weixugc@google.com> writes:

> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>> > In the current kernel, memory tiers are defined implicitly via a
>> > demotion path relationship between NUMA nodes, which is created
>> > during the kernel initialization and updated when a NUMA node is
>> > hot-added or hot-removed.  The current implementation puts all
>> > nodes with CPU into the top tier, and builds the tier hierarchy
>> > tier-by-tier by establishing the per-node demotion targets based
>> > on the distances between nodes.
>> >
>> > This current memory tier kernel interface needs to be improved for
>> > several important use cases,
>> >
>> > The current tier initialization code always initializes
>> > each memory-only NUMA node into a lower tier.  But a memory-only
>> > NUMA node may have a high performance memory device (e.g. a DRAM
>> > device attached via CXL.mem or a DRAM-backed memory-only node on
>> > a virtual machine) and should be put into a higher tier.
>> >
>> > The current tier hierarchy always puts CPU nodes into the top
>> > tier. But on a system with HBM or GPU devices, the
>> > memory-only NUMA nodes mapping these devices should be in the
>> > top tier, and DRAM nodes with CPUs are better to be placed into the
>> > next lower tier.
>> >
>> > With current kernel higher tier node can only be demoted to selected nodes on the
>> > next lower tier as defined by the demotion path, not any other
>> > node from any lower tier.  This strict, hard-coded demotion order
>> > does not work in all use cases (e.g. some use cases may want to
>> > allow cross-socket demotion to another node in the same demotion
>> > tier as a fallback when the preferred demotion node is out of
>> > space), This demotion order is also inconsistent with the page
>> > allocation fallback order when all the nodes in a higher tier are
>> > out of space: The page allocation can fall back to any node from
>> > any lower tier, whereas the demotion order doesn't allow that.
>> >
>> > The current kernel also don't provide any interfaces for the
>> > userspace to learn about the memory tier hierarchy in order to
>> > optimize its memory allocations.
>> >
>> > This patch series address the above by defining memory tiers explicitly.
>> >
>> > This patch introduce explicity memory tiers. The tier ID value
>> > of a memory tier is used to derive the demotion order between
>> > NUMA nodes.
>> >
>> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
>> > where memtier300 is the highest tier and memtier100 is the lowest tier.
>> >
>> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>> > tiers when the fast(higher) tier is under memory pressure.
>> >
>> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>> > which are created by different kernel subsystems. The default memory
>> > tier created by the kernel is memtier200. A kernel parameter is provided
>> > to override the default memory tier.
>> >
>> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> >
>> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> > ---
>> >  include/linux/memory-tiers.h | 15 +++++++
>> >  mm/Makefile                  |  1 +
>> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 94 insertions(+)
>> >  create mode 100644 include/linux/memory-tiers.h
>> >  create mode 100644 mm/memory-tiers.c
>> >
>> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> > new file mode 100644
>> > index 000000000000..a81dbc20e0d1
>> > --- /dev/null
>> > +++ b/include/linux/memory-tiers.h
>> > @@ -0,0 +1,15 @@
>> > +/* SPDX-License-Identifier: GPL-2.0 */
>> > +#ifndef _LINUX_MEMORY_TIERS_H
>> > +#define _LINUX_MEMORY_TIERS_H
>> > +
>> > +#ifdef CONFIG_NUMA
>> > +
>> > +#define MEMORY_TIER_HBM_GPU  300
>> > +#define MEMORY_TIER_DRAM     200
>> > +#define MEMORY_TIER_PMEM     100
>> > +
>> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>> > +#define MAX_MEMORY_TIER_ID   400
>> > +
>> > +#endif       /* CONFIG_NUMA */
>> > +#endif  /* _LINUX_MEMORY_TIERS_H */
>> > diff --git a/mm/Makefile b/mm/Makefile
>> > index 6f9ffa968a1a..d30acebc2164 100644
>> > --- a/mm/Makefile
>> > +++ b/mm/Makefile
>> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>> >  obj-$(CONFIG_FAILSLAB) += failslab.o
>> >  obj-$(CONFIG_MEMTEST)                += memtest.o
>> >  obj-$(CONFIG_MIGRATION) += migrate.o
>> > +obj-$(CONFIG_NUMA) += memory-tiers.o
>> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > new file mode 100644
>> > index 000000000000..011877b6dbb9
>> > --- /dev/null
>> > +++ b/mm/memory-tiers.c
>> > @@ -0,0 +1,78 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +#include <linux/types.h>
>> > +#include <linux/nodemask.h>
>> > +#include <linux/slab.h>
>> > +#include <linux/lockdep.h>
>> > +#include <linux/moduleparam.h>
>> > +#include <linux/memory-tiers.h>
>> > +
>> > +struct memory_tier {
>> > +     struct list_head list;
>> > +     int id;
>> > +     nodemask_t nodelist;
>> > +};
>> > +
>> > +static DEFINE_MUTEX(memory_tier_lock);
>> > +static LIST_HEAD(memory_tiers);
>> > +
>> > +static void insert_memory_tier(struct memory_tier *memtier)
>> > +{
>> > +     struct list_head *ent;
>> > +     struct memory_tier *tmp_memtier;
>> > +
>> > +     lockdep_assert_held_once(&memory_tier_lock);
>> > +
>> > +     list_for_each(ent, &memory_tiers) {
>> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>> > +             if (tmp_memtier->id < memtier->id) {
>> > +                     list_add_tail(&memtier->list, ent);
>> > +                     return;
>> > +             }
>> > +     }
>> > +     list_add_tail(&memtier->list, &memory_tiers);
>> > +}
>> > +
>> > +static struct memory_tier *register_memory_tier(unsigned int tier)
>> > +{
>> > +     struct memory_tier *memtier;
>> > +
>> > +     if (tier > MAX_MEMORY_TIER_ID)
>> > +             return ERR_PTR(-EINVAL);
>> > +
>> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> > +     if (!memtier)
>> > +             return ERR_PTR(-ENOMEM);
>> > +
>> > +     memtier->id   = tier;
>> > +
>> > +     insert_memory_tier(memtier);
>> > +
>> > +     return memtier;
>> > +}
>> > +
>> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>> > +core_param(default_memory_tier, default_memtier, uint, 0644);
>> > +
>> > +static int __init memory_tier_init(void)
>> > +{
>> > +     struct memory_tier *memtier;
>> > +
>> > +     /*
>> > +      * Register only default memory tier to hide all empty
>> > +      * memory tier from sysfs. Since this is early during
>> > +      * boot, we could avoid holding memtory_tier_lock. But
>> > +      * keep it simple by holding locks. So we can add lock
>> > +      * held debug checks in other functions.
>> > +      */
>> > +     mutex_lock(&memory_tier_lock);
>> > +     memtier = register_memory_tier(default_memtier);
>> > +     if (IS_ERR(memtier))
>> > +             panic("%s() failed to register memory tier: %ld\n",
>> > +                   __func__, PTR_ERR(memtier));
>> > +
>> > +     /* CPU only nodes are not part of memory tiers. */
>> > +     memtier->nodelist = node_states[N_MEMORY];
>> > +     mutex_unlock(&memory_tier_lock);
>> > +     return 0;
>> > +}
>> > +subsys_initcall(memory_tier_init);
>>
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.
>
> I agree with this design, though I'd prefer the new attribute to not
> be named as "distance".  This is to both avoid the confusion with the
> SLIT distance and to avoid the misconception that only the latency
> matters, but the bandwidth doesn't.
>
> How about we call it "performance level" (perf_level) or something
> similar instead?

I have no strong opinion on this.  Both "distance" or "perf_level" looks
OK to me.

>> The distance of each NUMA
>>   node can be recorded in a global array,
>>
>>     int node_distances[MAX_NUMNODES];
>>
>>   or, just
>>
>>     pgdat->distance
>
> I think node_devices[] is a better place to record this new attribute.
> The HMAT performance data is also listed there.

Firstly, we all agree that we need a place to record this information,
per node or per memory type.  Personally, I prefer to separate the data
and its interface (such as sysfs).

>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>
>> - The core API of memory tier could be
>>
>>     struct memory_tier *find_create_memory_tier(int distance);
>>
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>>
>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>   them.
>
> This attribute should be a property of the NUMA node based on the
> device hardware.

Yes.  Or a property of a memory type.

> For PMEM, it is better to handle at the ACPI level.
> For example, we can consider initializing this attribute for a PMEM
> node in acpi_numa_memory_affinity_init() when the node is
> non-volatile.

The abstract_distance/perf_level may be determined from multiple
information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
responsibility of device drivers (e.g., kmem_dax) to determine the final
value of abstract_distance/perf_level based on the information
availability/priority and some specific knowledge of the hardware.  Yes,
ACPI SRAT is valuable to determine the abstract_distance/perf_level.
And, it's better for kmem_dax to use it to determine the final value of
abstract_distance/perf_level.

To make the first version as simple as possible, I think we can just use
some static abstract_distance/perf_level in kmem_dax driver for the NUMA
nodes onlined by it.  Because we use the driver for PMEM only now.  We
can enhance the implementation later.

>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>   find or create its memory tier and add the NUMA node into the memory
>>   tier.
>
> I think we should create all the memory tiers up-front, just like NUMA
> nodes, to keep their devices and IDs stable.  Similar to offline NUMA
> nodes, when a memory tier has no online nodes, we can mark it as
> offline and exclude it from online-related operations (e.g. demotion).
> A memory tier can be made online when it gets assigned with an online
> node.

Each memory tier corresponds to a range of abstract_distance/perf_level.
For example, if 1 <= abstract_distance/perf_level <= 500, 5 memory tiers
can be defined with abstract_distance/perf_level ranges 1-100, 101-200,
201-300, 301-400, 401-500.  We can create these 5 memory tiers up-front
of course.  In the new design, we may change the ranges at run time
according to policy chosen by the users.  For example, we may change 5
memory tiers above to 500 memory tiers, with
abstract_distance/perf_level ranges 1-1, 2-2, ..., 500-500.  This may
make memory tier devices and their IDs unstable at some degree.  But if
we are cautious to customize the ranges, it's possible to make the
memory tier devices and their IDs stable in most cases.

Because we may define 500 memory tiers, it's hard to create all memory
tiers up-front really.  But we can create them all in concept and
allocate memory/resources for one when we add the first NUMA node to it.

To make the fist version as simple as possible, I suggest to define 500
memory tiers as above statically.

>> - Or we can add memory type data structure now.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-07-15  7:21     ` Aneesh Kumar K.V
@ 2022-07-18  5:41       ` Alistair Popple
  0 siblings, 0 replies; 29+ messages in thread
From: Alistair Popple @ 2022-07-18  5:41 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Huang Ying, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss


"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
> ....
>
>> + */
>>> +static void establish_migration_targets(void)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +	struct demotion_nodes *nd;
>>> +	int target = NUMA_NO_NODE, node;
>>> +	int distance, best_distance;
>>> +	nodemask_t used;
>>> +
>>> +	if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION))
>>
>> Does it make sense to include the memory tiering/demotion code if
>> CONFIG_MIGRATION isn't enabled? From what I can tell none of the
>> information established here is used if CONFIG_MIGRATION isn't enabled,
>> so it would be better to remove the IS_ENABLED checks and not include
>> the code at all.
>
> We use the same function/codepath for updating top_tier details. We
> would want to get node_is_toptier() to work even with CONFIG_MIGRATION
> disabled?

Why though? As far as I can tell node_is_toptier() only makes a
difference if CONFIG_MIGRATION is enabled, so it could just return a
static value if CONFIG_MIGRATION isn't enabled.

 - Alistair

>>
>>> +		return;
>>> +
>>> +	disable_all_migrate_targets();
>>> +
>>> +	for_each_node_state(node, N_MEMORY) {
>>> +		best_distance = -1;
>>> +		nd = &node_demotion[node];
>>> +
>>> +		memtier = __node_get_memory_tier(node);
>>> +		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
>>> +			continue;
>>> +		/*
>>> +		 * Get the next memtier to find the  demotion node list.
>>> +		 */
>>> +		memtier = list_next_entry(memtier, list);
>>> +
>>> +		/*
>>> +		 * find_next_best_node, use 'used' nodemask as a skip list.
>>> +		 * Add all memory nodes except the selected memory tier
>>> +		 * nodelist to skip list so that we find the best node from the
>>> +		 * memtier nodelist.
>>> +		 */
>>> +		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
>>> +
>>> +		/*
>>> +		 * Find all the nodes in the memory tier node list of same best distance.
>>> +		 * add them to the preferred mask. We randomly select between nodes
>>> +		 * in the preferred mask when allocating pages during demotion.
>>> +		 */
>>> +		do {
>>> +			target = find_next_best_node(node, &used);
>>> +			if (target == NUMA_NO_NODE)
>>> +				break;
>>> +
>>> +			distance = node_distance(node, target);
>>> +			if (distance == best_distance || best_distance == -1) {
>>> +				best_distance = distance;
>>> +				node_set(target, nd->preferred);
>>> +			} else {
>>> +				break;
>>> +			}
>>> +		} while (1);
>>> +	}
>>> +}
>>> +
>
> .....
>
> -aneesh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-18  5:28       ` Huang, Ying
@ 2022-07-18  5:58         ` Alistair Popple
  2022-07-18  6:56           ` Aneesh Kumar K V
  0 siblings, 1 reply; 29+ messages in thread
From: Alistair Popple @ 2022-07-18  5:58 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Wei Xu, Aneesh Kumar K.V, Linux MM, Andrew Morton, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Dan Williams, Johannes Weiner, jvgediya.oss,
	Jagdish Gediya


"Huang, Ying" <ying.huang@intel.com> writes:

> Wei Xu <weixugc@google.com> writes:
>
>> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>> > In the current kernel, memory tiers are defined implicitly via a
>>> > demotion path relationship between NUMA nodes, which is created
>>> > during the kernel initialization and updated when a NUMA node is
>>> > hot-added or hot-removed.  The current implementation puts all
>>> > nodes with CPU into the top tier, and builds the tier hierarchy
>>> > tier-by-tier by establishing the per-node demotion targets based
>>> > on the distances between nodes.
>>> >
>>> > This current memory tier kernel interface needs to be improved for
>>> > several important use cases,
>>> >
>>> > The current tier initialization code always initializes
>>> > each memory-only NUMA node into a lower tier.  But a memory-only
>>> > NUMA node may have a high performance memory device (e.g. a DRAM
>>> > device attached via CXL.mem or a DRAM-backed memory-only node on
>>> > a virtual machine) and should be put into a higher tier.
>>> >
>>> > The current tier hierarchy always puts CPU nodes into the top
>>> > tier. But on a system with HBM or GPU devices, the
>>> > memory-only NUMA nodes mapping these devices should be in the
>>> > top tier, and DRAM nodes with CPUs are better to be placed into the
>>> > next lower tier.
>>> >
>>> > With current kernel higher tier node can only be demoted to selected nodes on the
>>> > next lower tier as defined by the demotion path, not any other
>>> > node from any lower tier.  This strict, hard-coded demotion order
>>> > does not work in all use cases (e.g. some use cases may want to
>>> > allow cross-socket demotion to another node in the same demotion
>>> > tier as a fallback when the preferred demotion node is out of
>>> > space), This demotion order is also inconsistent with the page
>>> > allocation fallback order when all the nodes in a higher tier are
>>> > out of space: The page allocation can fall back to any node from
>>> > any lower tier, whereas the demotion order doesn't allow that.
>>> >
>>> > The current kernel also don't provide any interfaces for the
>>> > userspace to learn about the memory tier hierarchy in order to
>>> > optimize its memory allocations.
>>> >
>>> > This patch series address the above by defining memory tiers explicitly.
>>> >
>>> > This patch introduce explicity memory tiers. The tier ID value
>>> > of a memory tier is used to derive the demotion order between
>>> > NUMA nodes.
>>> >
>>> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>> > where memtier300 is the highest tier and memtier100 is the lowest tier.
>>> >
>>> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>> > tiers when the fast(higher) tier is under memory pressure.
>>> >
>>> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>> > which are created by different kernel subsystems. The default memory
>>> > tier created by the kernel is memtier200. A kernel parameter is provided
>>> > to override the default memory tier.
>>> >
>>> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>> >
>>> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> > ---
>>> >  include/linux/memory-tiers.h | 15 +++++++
>>> >  mm/Makefile                  |  1 +
>>> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>> >  3 files changed, 94 insertions(+)
>>> >  create mode 100644 include/linux/memory-tiers.h
>>> >  create mode 100644 mm/memory-tiers.c
>>> >
>>> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> > new file mode 100644
>>> > index 000000000000..a81dbc20e0d1
>>> > --- /dev/null
>>> > +++ b/include/linux/memory-tiers.h
>>> > @@ -0,0 +1,15 @@
>>> > +/* SPDX-License-Identifier: GPL-2.0 */
>>> > +#ifndef _LINUX_MEMORY_TIERS_H
>>> > +#define _LINUX_MEMORY_TIERS_H
>>> > +
>>> > +#ifdef CONFIG_NUMA
>>> > +
>>> > +#define MEMORY_TIER_HBM_GPU  300
>>> > +#define MEMORY_TIER_DRAM     200
>>> > +#define MEMORY_TIER_PMEM     100
>>> > +
>>> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>>> > +#define MAX_MEMORY_TIER_ID   400
>>> > +
>>> > +#endif       /* CONFIG_NUMA */
>>> > +#endif  /* _LINUX_MEMORY_TIERS_H */
>>> > diff --git a/mm/Makefile b/mm/Makefile
>>> > index 6f9ffa968a1a..d30acebc2164 100644
>>> > --- a/mm/Makefile
>>> > +++ b/mm/Makefile
>>> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>> >  obj-$(CONFIG_FAILSLAB) += failslab.o
>>> >  obj-$(CONFIG_MEMTEST)                += memtest.o
>>> >  obj-$(CONFIG_MIGRATION) += migrate.o
>>> > +obj-$(CONFIG_NUMA) += memory-tiers.o
>>> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> > new file mode 100644
>>> > index 000000000000..011877b6dbb9
>>> > --- /dev/null
>>> > +++ b/mm/memory-tiers.c
>>> > @@ -0,0 +1,78 @@
>>> > +// SPDX-License-Identifier: GPL-2.0
>>> > +#include <linux/types.h>
>>> > +#include <linux/nodemask.h>
>>> > +#include <linux/slab.h>
>>> > +#include <linux/lockdep.h>
>>> > +#include <linux/moduleparam.h>
>>> > +#include <linux/memory-tiers.h>
>>> > +
>>> > +struct memory_tier {
>>> > +     struct list_head list;
>>> > +     int id;
>>> > +     nodemask_t nodelist;
>>> > +};
>>> > +
>>> > +static DEFINE_MUTEX(memory_tier_lock);
>>> > +static LIST_HEAD(memory_tiers);
>>> > +
>>> > +static void insert_memory_tier(struct memory_tier *memtier)
>>> > +{
>>> > +     struct list_head *ent;
>>> > +     struct memory_tier *tmp_memtier;
>>> > +
>>> > +     lockdep_assert_held_once(&memory_tier_lock);
>>> > +
>>> > +     list_for_each(ent, &memory_tiers) {
>>> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>>> > +             if (tmp_memtier->id < memtier->id) {
>>> > +                     list_add_tail(&memtier->list, ent);
>>> > +                     return;
>>> > +             }
>>> > +     }
>>> > +     list_add_tail(&memtier->list, &memory_tiers);
>>> > +}
>>> > +
>>> > +static struct memory_tier *register_memory_tier(unsigned int tier)
>>> > +{
>>> > +     struct memory_tier *memtier;
>>> > +
>>> > +     if (tier > MAX_MEMORY_TIER_ID)
>>> > +             return ERR_PTR(-EINVAL);
>>> > +
>>> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>> > +     if (!memtier)
>>> > +             return ERR_PTR(-ENOMEM);
>>> > +
>>> > +     memtier->id   = tier;
>>> > +
>>> > +     insert_memory_tier(memtier);
>>> > +
>>> > +     return memtier;
>>> > +}
>>> > +
>>> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>> > +core_param(default_memory_tier, default_memtier, uint, 0644);
>>> > +
>>> > +static int __init memory_tier_init(void)
>>> > +{
>>> > +     struct memory_tier *memtier;
>>> > +
>>> > +     /*
>>> > +      * Register only default memory tier to hide all empty
>>> > +      * memory tier from sysfs. Since this is early during
>>> > +      * boot, we could avoid holding memtory_tier_lock. But
>>> > +      * keep it simple by holding locks. So we can add lock
>>> > +      * held debug checks in other functions.
>>> > +      */
>>> > +     mutex_lock(&memory_tier_lock);
>>> > +     memtier = register_memory_tier(default_memtier);
>>> > +     if (IS_ERR(memtier))
>>> > +             panic("%s() failed to register memory tier: %ld\n",
>>> > +                   __func__, PTR_ERR(memtier));
>>> > +
>>> > +     /* CPU only nodes are not part of memory tiers. */
>>> > +     memtier->nodelist = node_states[N_MEMORY];
>>> > +     mutex_unlock(&memory_tier_lock);
>>> > +     return 0;
>>> > +}
>>> > +subsys_initcall(memory_tier_init);
>>>
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.
>>
>> I agree with this design, though I'd prefer the new attribute to not
>> be named as "distance".  This is to both avoid the confusion with the
>> SLIT distance and to avoid the misconception that only the latency
>> matters, but the bandwidth doesn't.
>>
>> How about we call it "performance level" (perf_level) or something
>> similar instead?
>
> I have no strong opinion on this.  Both "distance" or "perf_level" looks
> OK to me.
>
>>> The distance of each NUMA
>>>   node can be recorded in a global array,
>>>
>>>     int node_distances[MAX_NUMNODES];
>>>
>>>   or, just
>>>
>>>     pgdat->distance
>>
>> I think node_devices[] is a better place to record this new attribute.
>> The HMAT performance data is also listed there.
>
> Firstly, we all agree that we need a place to record this information,
> per node or per memory type.  Personally, I prefer to separate the data
> and its interface (such as sysfs).
>
>>> - Each memory tier corresponds to a range of distance, for example,
>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>>
>>> - The core API of memory tier could be
>>>
>>>     struct memory_tier *find_create_memory_tier(int distance);
>>>
>>>   it will find the memory tier which covers "distance" in the memory
>>>   tier list, or create a new memory tier if not found.
>>>
>>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>>   them.
>>
>> This attribute should be a property of the NUMA node based on the
>> device hardware.
>
> Yes.  Or a property of a memory type.
>
>> For PMEM, it is better to handle at the ACPI level.
>> For example, we can consider initializing this attribute for a PMEM
>> node in acpi_numa_memory_affinity_init() when the node is
>> non-volatile.
>
> The abstract_distance/perf_level may be determined from multiple
> information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
> responsibility of device drivers (e.g., kmem_dax) to determine the final
> value of abstract_distance/perf_level based on the information
> availability/priority and some specific knowledge of the hardware.  Yes,
> ACPI SRAT is valuable to determine the abstract_distance/perf_level.
> And, it's better for kmem_dax to use it to determine the final value of
> abstract_distance/perf_level.
>
> To make the first version as simple as possible, I think we can just use
> some static abstract_distance/perf_level in kmem_dax driver for the NUMA
> nodes onlined by it.  Because we use the driver for PMEM only now.  We
> can enhance the implementation later.

I agree. Ideally I think all this should be derived from ACPI tables,
etc. However I think it will take a while for both FW and SW to make
that information available and correct. Letting drivers initialise that
for now at least should aid development in determining how performance
levels should be set from multiple information sources, especially if
there is no way of overriding it from userspace.

>>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>>   find or create its memory tier and add the NUMA node into the memory
>>>   tier.
>>
>> I think we should create all the memory tiers up-front, just like NUMA
>> nodes, to keep their devices and IDs stable.  Similar to offline NUMA
>> nodes, when a memory tier has no online nodes, we can mark it as
>> offline and exclude it from online-related operations (e.g. demotion).
>> A memory tier can be made online when it gets assigned with an online
>> node.
>
> Each memory tier corresponds to a range of abstract_distance/perf_level.
> For example, if 1 <= abstract_distance/perf_level <= 500, 5 memory tiers
> can be defined with abstract_distance/perf_level ranges 1-100, 101-200,
> 201-300, 301-400, 401-500.  We can create these 5 memory tiers up-front
> of course.  In the new design, we may change the ranges at run time
> according to policy chosen by the users.  For example, we may change 5
> memory tiers above to 500 memory tiers, with
> abstract_distance/perf_level ranges 1-1, 2-2, ..., 500-500.  This may
> make memory tier devices and their IDs unstable at some degree.  But if
> we are cautious to customize the ranges, it's possible to make the
> memory tier devices and their IDs stable in most cases.
>
> Because we may define 500 memory tiers, it's hard to create all memory
> tiers up-front really.  But we can create them all in concept and
> allocate memory/resources for one when we add the first NUMA node to it.
>
> To make the fist version as simple as possible, I suggest to define 500
> memory tiers as above statically.
>
>>> - Or we can add memory type data structure now.
>
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15 10:27       ` Aneesh Kumar K.V
@ 2022-07-18  6:08         ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2022-07-18  6:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>> 
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>> 
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>   node can be recorded in a global array,
>>> 
>>>     int node_distances[MAX_NUMNODES];
>>> 
>>>   or, just
>>> 
>>>     pgdat->distance
>>> 
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch? 
>>
>>
>>> - Each memory tier corresponds to a range of distance, for example,
>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>> 
>>> - The core API of memory tier could be
>>> 
>>>     struct memory_tier *find_create_memory_tier(int distance);
>>> 
>>>   it will find the memory tier which covers "distance" in the memory
>>>   tier list, or create a new memory tier if not found.
>>> 
>>
>> I was expecting this to be internal to dax kmem. How dax kmem maps
>> "abstract distance" to a memory tier. At this point this patchset is
>> keeping all that for a future patchset. 
>>
>
> This shows how i was expecting "abstract distance" to be integrated.
>

Thanks!

To make the first version as simple as possible, I think we can just use
some static "abstract distance" for dax_kmem, e.g., 250.  Because we
use it for PMEM only now.  We can enhance dax_kmem later.

IMHO, we should make the core framework correct firstly.

- A device driver should report the capability (or performance level) of
  the hardware to the memory tier core via abstract distance.  This can
  be done via some global data structure (e.g. node_distances[]) at
  least in the first version.

- Memory tier core determines the mapping from the abstract distance to
  the memory tier via abstract distance ranges, and allocate the struct
  memory_tier when necessary.  That is, memory tier core determines
  whether to allocate or reuse which memory tier for NUMA nodes, not
  device drivers.

- It's better to place the NUMA node to the correct memory tier in the
  fist place.  We should avoid to place the PMEM node in the default
  tier, then change it to the correct memory tier.  That is, device
  drivers should report the abstract distance before onlining NUMA
  nodes.

Please check my reply to Wei too about my other suggestions for the
first version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-18  5:58         ` Alistair Popple
@ 2022-07-18  6:56           ` Aneesh Kumar K V
  0 siblings, 0 replies; 29+ messages in thread
From: Aneesh Kumar K V @ 2022-07-18  6:56 UTC (permalink / raw)
  To: Alistair Popple, Huang, Ying
  Cc: Wei Xu, Linux MM, Andrew Morton, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

On 7/18/22 11:28 AM, Alistair Popple wrote:
> 
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
>> Wei Xu <weixugc@google.com> writes:
>>
>>> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created
>>>>> during the kernel initialization and updated when a NUMA node is
>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>> on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases,
>>>>>
>>>>> The current tier initialization code always initializes
>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>> a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>> tier. But on a system with HBM or GPU devices, the
>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>> next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>> next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>> allow cross-socket demotion to another node in the same demotion
>>>>> tier as a fallback when the preferred demotion node is out of
>>>>> space), This demotion order is also inconsistent with the page
>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>> out of space: The page allocation can fall back to any node from
>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> The current kernel also don't provide any interfaces for the
>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>> optimize its memory allocations.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> This patch introduce explicity memory tiers. The tier ID value
>>>>> of a memory tier is used to derive the demotion order between
>>>>> NUMA nodes.
>>>>>
>>>>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>>>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>>>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>>>>
>>>>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>>>> tiers when the fast(higher) tier is under memory pressure.
>>>>>
>>>>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>>>> which are created by different kernel subsystems. The default memory
>>>>> tier created by the kernel is memtier200. A kernel parameter is provided
>>>>> to override the default memory tier.
>>>>>
>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>>  include/linux/memory-tiers.h | 15 +++++++
>>>>>  mm/Makefile                  |  1 +
>>>>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>>>>  3 files changed, 94 insertions(+)
>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>
>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>> new file mode 100644
>>>>> index 000000000000..a81dbc20e0d1
>>>>> --- /dev/null
>>>>> +++ b/include/linux/memory-tiers.h
>>>>> @@ -0,0 +1,15 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>> +
>>>>> +#ifdef CONFIG_NUMA
>>>>> +
>>>>> +#define MEMORY_TIER_HBM_GPU  300
>>>>> +#define MEMORY_TIER_DRAM     200
>>>>> +#define MEMORY_TIER_PMEM     100
>>>>> +
>>>>> +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>>>>> +#define MAX_MEMORY_TIER_ID   400
>>>>> +
>>>>> +#endif       /* CONFIG_NUMA */
>>>>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>>>>> diff --git a/mm/Makefile b/mm/Makefile
>>>>> index 6f9ffa968a1a..d30acebc2164 100644
>>>>> --- a/mm/Makefile
>>>>> +++ b/mm/Makefile
>>>>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>>>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>>>>  obj-$(CONFIG_MEMTEST)                += memtest.o
>>>>>  obj-$(CONFIG_MIGRATION) += migrate.o
>>>>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>>> new file mode 100644
>>>>> index 000000000000..011877b6dbb9
>>>>> --- /dev/null
>>>>> +++ b/mm/memory-tiers.c
>>>>> @@ -0,0 +1,78 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +#include <linux/types.h>
>>>>> +#include <linux/nodemask.h>
>>>>> +#include <linux/slab.h>
>>>>> +#include <linux/lockdep.h>
>>>>> +#include <linux/moduleparam.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>> +
>>>>> +struct memory_tier {
>>>>> +     struct list_head list;
>>>>> +     int id;
>>>>> +     nodemask_t nodelist;
>>>>> +};
>>>>> +
>>>>> +static DEFINE_MUTEX(memory_tier_lock);
>>>>> +static LIST_HEAD(memory_tiers);
>>>>> +
>>>>> +static void insert_memory_tier(struct memory_tier *memtier)
>>>>> +{
>>>>> +     struct list_head *ent;
>>>>> +     struct memory_tier *tmp_memtier;
>>>>> +
>>>>> +     lockdep_assert_held_once(&memory_tier_lock);
>>>>> +
>>>>> +     list_for_each(ent, &memory_tiers) {
>>>>> +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>>>>> +             if (tmp_memtier->id < memtier->id) {
>>>>> +                     list_add_tail(&memtier->list, ent);
>>>>> +                     return;
>>>>> +             }
>>>>> +     }
>>>>> +     list_add_tail(&memtier->list, &memory_tiers);
>>>>> +}
>>>>> +
>>>>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>>>>> +{
>>>>> +     struct memory_tier *memtier;
>>>>> +
>>>>> +     if (tier > MAX_MEMORY_TIER_ID)
>>>>> +             return ERR_PTR(-EINVAL);
>>>>> +
>>>>> +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>>>> +     if (!memtier)
>>>>> +             return ERR_PTR(-ENOMEM);
>>>>> +
>>>>> +     memtier->id   = tier;
>>>>> +
>>>>> +     insert_memory_tier(memtier);
>>>>> +
>>>>> +     return memtier;
>>>>> +}
>>>>> +
>>>>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>>>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>>>>> +
>>>>> +static int __init memory_tier_init(void)
>>>>> +{
>>>>> +     struct memory_tier *memtier;
>>>>> +
>>>>> +     /*
>>>>> +      * Register only default memory tier to hide all empty
>>>>> +      * memory tier from sysfs. Since this is early during
>>>>> +      * boot, we could avoid holding memtory_tier_lock. But
>>>>> +      * keep it simple by holding locks. So we can add lock
>>>>> +      * held debug checks in other functions.
>>>>> +      */
>>>>> +     mutex_lock(&memory_tier_lock);
>>>>> +     memtier = register_memory_tier(default_memtier);
>>>>> +     if (IS_ERR(memtier))
>>>>> +             panic("%s() failed to register memory tier: %ld\n",
>>>>> +                   __func__, PTR_ERR(memtier));
>>>>> +
>>>>> +     /* CPU only nodes are not part of memory tiers. */
>>>>> +     memtier->nodelist = node_states[N_MEMORY];
>>>>> +     mutex_unlock(&memory_tier_lock);
>>>>> +     return 0;
>>>>> +}
>>>>> +subsys_initcall(memory_tier_init);
>>>>
>>>> You dropped the original sysfs interface patches from the series, but
>>>> the kernel internal implementation is still for the original sysfs
>>>> interface.  For example, memory tier ID is for the original sysfs
>>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>>> to implement with the new interface in mind.  What do you think about
>>>> the following design?
>>>>
>>>> - Each NUMA node belongs to a memory type, and each memory type
>>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>>   a "distance".  For simplicity, we can start with static distances, for
>>>>   example, DRAM (default): 150, PMEM: 250.
>>>
>>> I agree with this design, though I'd prefer the new attribute to not
>>> be named as "distance".  This is to both avoid the confusion with the
>>> SLIT distance and to avoid the misconception that only the latency
>>> matters, but the bandwidth doesn't.
>>>
>>> How about we call it "performance level" (perf_level) or something
>>> similar instead?
>>
>> I have no strong opinion on this.  Both "distance" or "perf_level" looks
>> OK to me.
>>
>>>> The distance of each NUMA
>>>>   node can be recorded in a global array,
>>>>
>>>>     int node_distances[MAX_NUMNODES];
>>>>
>>>>   or, just
>>>>
>>>>     pgdat->distance
>>>
>>> I think node_devices[] is a better place to record this new attribute.
>>> The HMAT performance data is also listed there.
>>
>> Firstly, we all agree that we need a place to record this information,
>> per node or per memory type.  Personally, I prefer to separate the data
>> and its interface (such as sysfs).
>>
>>>> - Each memory tier corresponds to a range of distance, for example,
>>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>>>
>>>> - The core API of memory tier could be
>>>>
>>>>     struct memory_tier *find_create_memory_tier(int distance);
>>>>
>>>>   it will find the memory tier which covers "distance" in the memory
>>>>   tier list, or create a new memory tier if not found.
>>>>
>>>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>>>   them.
>>>
>>> This attribute should be a property of the NUMA node based on the
>>> device hardware.
>>
>> Yes.  Or a property of a memory type.
>>
>>> For PMEM, it is better to handle at the ACPI level.
>>> For example, we can consider initializing this attribute for a PMEM
>>> node in acpi_numa_memory_affinity_init() when the node is
>>> non-volatile.
>>
>> The abstract_distance/perf_level may be determined from multiple
>> information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
>> responsibility of device drivers (e.g., kmem_dax) to determine the final
>> value of abstract_distance/perf_level based on the information
>> availability/priority and some specific knowledge of the hardware.  Yes,
>> ACPI SRAT is valuable to determine the abstract_distance/perf_level.
>> And, it's better for kmem_dax to use it to determine the final value of
>> abstract_distance/perf_level.
>>
>> To make the first version as simple as possible, I think we can just use
>> some static abstract_distance/perf_level in kmem_dax driver for the NUMA
>> nodes onlined by it.  Because we use the driver for PMEM only now.  We
>> can enhance the implementation later.
> 
> I agree. Ideally I think all this should be derived from ACPI tables,
> etc. However I think it will take a while for both FW and SW to make
> that information available and correct. Letting drivers initialise that
> for now at least should aid development in determining how performance
> levels should be set from multiple information sources, especially if
> there is no way of overriding it from userspace.
>

When we parse the firmware tables, node_devices is mostly not allocated.
That get allocated in register_one_node. We can do a hotplug
callback like below. This should also allow us to update perf_level based
ACPI tables.

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..89b010e0461e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -15,6 +15,7 @@
 #include <linux/sort.h>
 #include <linux/io.h>
 #include <linux/nd.h>
+#include <linux/memory.h>
 #include <asm/cacheflush.h>
 #include <acpi/nfit.h>
 #include "intel.h"
@@ -3470,6 +3471,45 @@ static struct acpi_driver acpi_nfit_driver = {
 	},
 };
 
+static int nfit_callback(struct notifier_block *self,
+			 unsigned long action, void *arg)
+{
+	bool found = false;
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+	struct nfit_spa *nfit_spa;
+	struct acpi_nfit_desc *acpi_desc;
+
+	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+		return NOTIFY_OK;
+
+	mutex_lock(&acpi_desc_lock);
+	list_for_each_entry(acpi_desc, &acpi_descs, list) {
+		mutex_lock(&acpi_desc->init_mutex);
+		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+			struct acpi_nfit_system_address *spa = nfit_spa->spa;
+			int target_node = pxm_to_node(spa->proximity_domain);
+
+			if (target_node == nid) {
+				node_devices[nid]->perf_level = 1;
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&acpi_desc->init_mutex);
+		if (found)
+			break;
+	}
+	mutex_unlock(&acpi_desc_lock);
+	return NOTIFY_OK;
+}
+
+static struct notifier_block nfit_callback_nb = {
+	.notifier_call = nfit_callback,
+	.priority = 2,
+};
+
+
 static __init int nfit_init(void)
 {
 	int ret;
@@ -3509,7 +3549,11 @@ static __init int nfit_init(void)
 		nfit_mce_unregister();
 		destroy_workqueue(nfit_wq);
 	}
-
+	/*
+	 * register a memory hotplug notifier at prio 2 so that we
+	 * can update the perf level for the node.
+	 */
+	register_hotmemory_notifier(&nfit_callback_nb);
 	return ret;
 
 }



 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-15  9:08     ` Aneesh Kumar K V
  2022-07-15  9:24       ` Aneesh Kumar K V
  2022-07-15 10:27       ` Aneesh Kumar K.V
@ 2022-07-18  6:57       ` Huang, Ying
  2022-07-18  8:00         ` Aneesh Kumar K V
  2 siblings, 1 reply; 29+ messages in thread
From: Huang, Ying @ 2022-07-18  6:57 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/15/22 1:23 PM, Huang, Ying wrote:

[snip]

>> 
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>> 
>
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>
>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>> 
>>     int node_distances[MAX_NUMNODES];
>> 
>>   or, just
>> 
>>     pgdat->distance
>> 
>
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 

Written some pseudo code as follow to show my basic idea.

#define MEMORY_TIER_ADISTANCE_DRAM	150
#define MEMORY_TIER_ADISTANCE_PMEM	250

struct memory_tier {
	/* abstract distance range covered by the memory tier */
	int adistance_start;
	int adistance_len;
	struct list_head list;
	nodemask_t nodemask;
};

/* RCU list of memory tiers */
static LIST_HEAD(memory_tiers);

/* abstract distance of each NUMA node */
int node_adistances[MAX_NUMNODES];

struct memory_tier *find_create_memory_tier(int adistance)
{
	struct memory_tier *tier;

	list_for_each_entry(tier, &memory_tiers, list) {
		if (adistance >= tier->adistance_start &&
		    adistance < tier->adistance_start + tier->adistance_len)
			return tier;
	}
	/* allocate a new memory tier and return */
}

void memory_tier_add_node(int nid)
{
	int adistance;
	struct memory_tier *tier;

	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
	tier = find_create_memory_tier(adistance);
	node_set(nid, &tier->nodemask);
	/* setup demotion data structure, etc */
}

static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
						 unsigned long action, void *_arg)
{
	struct memory_notify *arg = _arg;
	int nid;

	nid = arg->status_change_nid;
	if (nid < 0)
		return notifier_from_errno(0);

	switch (action) {
	case MEM_ONLINE:
		memory_tier_add_node(nid);
		break;
	}

	return notifier_from_errno(0);
}

/* kmem.c */
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
	/* add_memory_driver_managed() */
}

[snip]

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-18  6:57       ` Huang, Ying
@ 2022-07-18  8:00         ` Aneesh Kumar K V
  2022-07-18  8:55           ` Huang, Ying
  0 siblings, 1 reply; 29+ messages in thread
From: Aneesh Kumar K V @ 2022-07-18  8:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

On 7/18/22 12:27 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/15/22 1:23 PM, Huang, Ying wrote:
> 
> [snip]
> 
>>>
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>>
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>   node can be recorded in a global array,
>>>
>>>     int node_distances[MAX_NUMNODES];
>>>
>>>   or, just
>>>
>>>     pgdat->distance
>>>
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch? 
> 
> Written some pseudo code as follow to show my basic idea.
> 
> #define MEMORY_TIER_ADISTANCE_DRAM	150
> #define MEMORY_TIER_ADISTANCE_PMEM	250
> 
> struct memory_tier {
> 	/* abstract distance range covered by the memory tier */
> 	int adistance_start;
> 	int adistance_len;
> 	struct list_head list;
> 	nodemask_t nodemask;
> };
> 
> /* RCU list of memory tiers */
> static LIST_HEAD(memory_tiers);
> 
> /* abstract distance of each NUMA node */
> int node_adistances[MAX_NUMNODES];
> 
> struct memory_tier *find_create_memory_tier(int adistance)
> {
> 	struct memory_tier *tier;
> 
> 	list_for_each_entry(tier, &memory_tiers, list) {
> 		if (adistance >= tier->adistance_start &&
> 		    adistance < tier->adistance_start + tier->adistance_len)
> 			return tier;
> 	}
> 	/* allocate a new memory tier and return */
> }
> 
> void memory_tier_add_node(int nid)
> {
> 	int adistance;
> 	struct memory_tier *tier;
> 
> 	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
> 	tier = find_create_memory_tier(adistance);
> 	node_set(nid, &tier->nodemask);
> 	/* setup demotion data structure, etc */
> }
> 
> static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> 						 unsigned long action, void *_arg)
> {
> 	struct memory_notify *arg = _arg;
> 	int nid;
> 
> 	nid = arg->status_change_nid;
> 	if (nid < 0)
> 		return notifier_from_errno(0);
> 
> 	switch (action) {
> 	case MEM_ONLINE:
> 		memory_tier_add_node(nid);
> 		break;
> 	}
> 
> 	return notifier_from_errno(0);
> }
> 
> /* kmem.c */
> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> {
> 	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
> 	/* add_memory_driver_managed() */
> }
> 
> [snip]
> 
> Best Regards,
> Huang, Ying


Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property
instead it is a kernel parameter like memory_tier_chunk_size which can be tuned to create more memory tiers.
How about this? Not yet tested.

struct memory_tier {
	struct list_head list;
	int id;
	int perf_level;
	nodemask_t nodelist;
};

static LIST_HEAD(memory_tiers);
static DEFINE_MUTEX(memory_tier_lock);
static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF;
core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);
static unsigned int memtier_perf_chunk_size = 150;
core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644);

/*
 * performance levels are grouped into memtiers each of chunk size
 * memtier_perf_chunk
 */
static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
{
	bool found_slot = false;
	struct list_head *ent;
	struct memory_tier *memtier, *new_memtier;
	static int next_memtier_id = 0;
	/*
	 * zero is special in that it indicates uninitialized
	 * perf level by respective driver. Pick default memory
	 * tier perf level for that.
	 */
	if (!perf_level)
		perf_level = default_memtier_perf_level;

	lockdep_assert_held_once(&memory_tier_lock);

	list_for_each(ent, &memory_tiers) {
		memtier = list_entry(ent, struct memory_tier, list);
		if (perf_level >= memtier->perf_level &&
		    perf_level < memtier->perf_level + memtier_perf_chunk_size)
			return memtier;
		else if (perf_level < memtier->perf_level) {
			found_slot = true;
			break;
		}
	}

	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
	if (!new_memtier)
		return ERR_PTR(-ENOMEM);

	new_memtier->id = next_memtier_id++;
	new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size);
	if (found_slot)
		list_add_tail(&new_memtier->list, ent);
	else
		list_add_tail(&new_memtier->list, &memory_tiers);
	return new_memtier;
}

static int __init memory_tier_init(void)
{
	int node;
	struct memory_tier *memtier;

	/*
	 * Since this is early during  boot, we could avoid
	 * holding memtory_tier_lock. But keep it simple by
	 * holding locks. So we can add lock held debug checks
	 * in other functions.
	 */
	mutex_lock(&memory_tier_lock);
	memtier = find_create_memory_tier(default_memtier_perf_level);
	if (IS_ERR(memtier))
		panic("%s() failed to register memory tier: %ld\n",
		      __func__, PTR_ERR(memtier));

	/* CPU only nodes are not part of memory tiers. */
	memtier->nodelist = node_states[N_MEMORY];

	/*
	 * nodes that are already online and that doesn't
	 * have perf level assigned is assigned a default perf
	 * level.
	 */
	for_each_node_state(node, N_MEMORY) {
		struct node *node_property = node_devices[node];

		if (!node_property->perf_level)
			node_property->perf_level = default_memtier_perf_level;
	}
	mutex_unlock(&memory_tier_lock);
	return 0;
}
subsys_initcall(memory_tier_init);


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
  2022-07-18  8:00         ` Aneesh Kumar K V
@ 2022-07-18  8:55           ` Huang, Ying
  0 siblings, 0 replies; 29+ messages in thread
From: Huang, Ying @ 2022-07-18  8:55 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Wei Xu, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Johannes Weiner, jvgediya.oss, Jagdish Gediya

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/18/22 12:27 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/15/22 1:23 PM, Huang, Ying wrote:
>> 
>> [snip]
>> 
>>>>
>>>> You dropped the original sysfs interface patches from the series, but
>>>> the kernel internal implementation is still for the original sysfs
>>>> interface.  For example, memory tier ID is for the original sysfs
>>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>>> to implement with the new interface in mind.  What do you think about
>>>> the following design?
>>>>
>>>
>>> Sorry I am not able to follow you here. This patchset completely drops
>>> exposing memory tiers to userspace via sysfs. Instead it allow
>>> creation of memory tiers with specific tierID from within the kernel/device driver.
>>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>>
>>>
>>>> - Each NUMA node belongs to a memory type, and each memory type
>>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>>   a "distance".  For simplicity, we can start with static distances, for
>>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>>   node can be recorded in a global array,
>>>>
>>>>     int node_distances[MAX_NUMNODES];
>>>>
>>>>   or, just
>>>>
>>>>     pgdat->distance
>>>>
>>>
>>> I don't follow this. I guess you are trying to have a different design.
>>> Would it be much easier if you can write this in the form of a patch? 
>> 
>> Written some pseudo code as follow to show my basic idea.
>> 
>> #define MEMORY_TIER_ADISTANCE_DRAM	150
>> #define MEMORY_TIER_ADISTANCE_PMEM	250
>> 
>> struct memory_tier {
>> 	/* abstract distance range covered by the memory tier */
>> 	int adistance_start;
>> 	int adistance_len;
>> 	struct list_head list;
>> 	nodemask_t nodemask;
>> };
>> 
>> /* RCU list of memory tiers */
>> static LIST_HEAD(memory_tiers);
>> 
>> /* abstract distance of each NUMA node */
>> int node_adistances[MAX_NUMNODES];
>> 
>> struct memory_tier *find_create_memory_tier(int adistance)
>> {
>> 	struct memory_tier *tier;
>> 
>> 	list_for_each_entry(tier, &memory_tiers, list) {
>> 		if (adistance >= tier->adistance_start &&
>> 		    adistance < tier->adistance_start + tier->adistance_len)
>> 			return tier;
>> 	}
>> 	/* allocate a new memory tier and return */
>> }
>> 
>> void memory_tier_add_node(int nid)
>> {
>> 	int adistance;
>> 	struct memory_tier *tier;
>> 
>> 	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
>> 	tier = find_create_memory_tier(adistance);
>> 	node_set(nid, &tier->nodemask);
>> 	/* setup demotion data structure, etc */
>> }
>> 
>> static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> 						 unsigned long action, void *_arg)
>> {
>> 	struct memory_notify *arg = _arg;
>> 	int nid;
>> 
>> 	nid = arg->status_change_nid;
>> 	if (nid < 0)
>> 		return notifier_from_errno(0);
>> 
>> 	switch (action) {
>> 	case MEM_ONLINE:
>> 		memory_tier_add_node(nid);
>> 		break;
>> 	}
>> 
>> 	return notifier_from_errno(0);
>> }
>> 
>> /* kmem.c */
>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> {
>> 	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
>> 	/* add_memory_driver_managed() */
>> }
>> 
>> [snip]
>> 
>> Best Regards,
>> Huang, Ying
>
>
> Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property
> instead it is a kernel parameter like memory_tier_chunk_size which can
> be tuned to create more memory tiers.

It's not determined how to represent the range of abstract distance of
memory tier.  perf_level_chunk_size or perf_level_granularity is another
possible solution.  But I don't think it should be a kernel parameter
for the fist step.

> How about this? Not yet tested.
>
> struct memory_tier {
> 	struct list_head list;
> 	int id;

We don't need "id" for now in fact.  So I suggest to remove it.  We can
add it when we really need it.

> 	int perf_level;
> 	nodemask_t nodelist;
> };
>
> static LIST_HEAD(memory_tiers);
> static DEFINE_MUTEX(memory_tier_lock);
> static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF;
> core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);
> static unsigned int memtier_perf_chunk_size = 150;
> core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644);
>
> /*
>  * performance levels are grouped into memtiers each of chunk size
>  * memtier_perf_chunk
>  */
> static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
> {
> 	bool found_slot = false;
> 	struct list_head *ent;
> 	struct memory_tier *memtier, *new_memtier;
> 	static int next_memtier_id = 0;
> 	/*
> 	 * zero is special in that it indicates uninitialized
> 	 * perf level by respective driver. Pick default memory
> 	 * tier perf level for that.
> 	 */
> 	if (!perf_level)
> 		perf_level = default_memtier_perf_level;
>
> 	lockdep_assert_held_once(&memory_tier_lock);
>
> 	list_for_each(ent, &memory_tiers) {
> 		memtier = list_entry(ent, struct memory_tier, list);
> 		if (perf_level >= memtier->perf_level &&
> 		    perf_level < memtier->perf_level + memtier_perf_chunk_size)
> 			return memtier;
> 		else if (perf_level < memtier->perf_level) {
> 			found_slot = true;
> 			break;
> 		}
> 	}
>
> 	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> 	if (!new_memtier)
> 		return ERR_PTR(-ENOMEM);
>
> 	new_memtier->id = next_memtier_id++;
> 	new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size);
> 	if (found_slot)
> 		list_add_tail(&new_memtier->list, ent);
> 	else
> 		list_add_tail(&new_memtier->list, &memory_tiers);
> 	return new_memtier;
> }
>
> static int __init memory_tier_init(void)
> {
> 	int node;
> 	struct memory_tier *memtier;
>
> 	/*
> 	 * Since this is early during  boot, we could avoid
> 	 * holding memtory_tier_lock. But keep it simple by
> 	 * holding locks. So we can add lock held debug checks
> 	 * in other functions.
> 	 */
> 	mutex_lock(&memory_tier_lock);
> 	memtier = find_create_memory_tier(default_memtier_perf_level);
> 	if (IS_ERR(memtier))
> 		panic("%s() failed to register memory tier: %ld\n",
> 		      __func__, PTR_ERR(memtier));
>
> 	/* CPU only nodes are not part of memory tiers. */
> 	memtier->nodelist = node_states[N_MEMORY];
>
> 	/*
> 	 * nodes that are already online and that doesn't
> 	 * have perf level assigned is assigned a default perf
> 	 * level.
> 	 */
> 	for_each_node_state(node, N_MEMORY) {
> 		struct node *node_property = node_devices[node];
>
> 		if (!node_property->perf_level)
> 			node_property->perf_level = default_memtier_perf_level;
> 	}
> 	mutex_unlock(&memory_tier_lock);
> 	return 0;
> }
> subsys_initcall(memory_tier_init);

I think that this can be a starting point of our future discussion and
review.  Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-07-18  8:55 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-14  4:53 [PATCH v9 0/8] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-07-14  4:53 ` [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-07-15  7:53   ` Huang, Ying
2022-07-15  9:08     ` Aneesh Kumar K V
2022-07-15  9:24       ` Aneesh Kumar K V
2022-07-15 10:27       ` Aneesh Kumar K.V
2022-07-18  6:08         ` Huang, Ying
2022-07-18  6:57       ` Huang, Ying
2022-07-18  8:00         ` Aneesh Kumar K V
2022-07-18  8:55           ` Huang, Ying
2022-07-15 16:59     ` Wei Xu
2022-07-18  5:28       ` Huang, Ying
2022-07-18  5:58         ` Alistair Popple
2022-07-18  6:56           ` Aneesh Kumar K V
2022-07-14  4:53 ` [PATCH v9 2/8] mm/demotion: Move memory demotion related code Aneesh Kumar K.V
2022-07-14  4:53 ` [PATCH v9 3/8] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-07-14  4:53 ` [PATCH v9 4/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Aneesh Kumar K.V
2022-07-15  4:38   ` Alistair Popple
2022-07-15  7:23     ` Aneesh Kumar K.V
2022-07-14  4:53 ` [PATCH v9 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-07-15  4:47   ` Alistair Popple
2022-07-15  7:21     ` Aneesh Kumar K.V
2022-07-18  5:41       ` Alistair Popple
2022-07-14  4:53 ` [PATCH v9 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Aneesh Kumar K.V
2022-07-15  5:49   ` Alistair Popple
2022-07-15  7:19     ` Aneesh Kumar K.V
2022-07-18  5:22       ` Alistair Popple
2022-07-14  4:53 ` [PATCH v9 7/8] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-07-14  4:53 ` [PATCH v9 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.