All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics
@ 2022-04-16  5:38 Davidlohr Bueso
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:38 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

Hello,

With the increasing popularity of memory tiering, the idea of this series is to trigger
some discussion around David's[1] system-wide proactive reclaim beyond memcg[2] as well
as sysfs as the interface for exporting system-wide tiering information[2]. I am
hoping this can be discussed at LSFMM, and while I know many are interested in tiering
subjects in general, I have not seen anyone bring this up in the list.

There has been some initial discussion towards the need to expose system-wide tiering
information to userspace. I thought I'd start with two sysfs files as a node attribute
that exports the demotion_node as well as whether or not the node is fast memory. This
was considered (and I agree) better than a new /sys/devices/system/tier/tierN/ interface.
So, are we going to go this route? If so, what further information is useful for users?
Does having instead a /sys/devices/system/node/nodeN/reclaim/ make sense?
  
Applies against Linus' current tree and has only been _gently_ tested.

Thanks!

Davidlohr Bueso (6):
  drivers/base/node: cleanup register_node()
  mm/vmscan: use node_is_toptier helper in node_reclaim
  mm: make __node_reclaim() more flexible
  mm: introduce per-node proactive reclaim interface
  mm/migration: export demotion_path of a node via sysfs
  mm/migrate: export whether or not tier is toptier in sysfs

 Documentation/ABI/stable/sysfs-devices-node |  22 ++++
 drivers/base/node.c                         |  68 ++++++++++--
 include/linux/migrate.h                     |  15 +++
 include/linux/swap.h                        |  16 +++
 include/trace/events/vmscan.h               |  12 +--
 mm/migrate.c                                |  15 +--
 mm/vmscan.c                                 | 108 +++++++++++++++-----
 7 files changed, 206 insertions(+), 50 deletions(-)

--
2.26.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] drivers/base/node: cleanup register_node()
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
@ 2022-04-16  5:38 ` Davidlohr Bueso
  2022-04-25 22:30   ` Adam Manzanares
                     ` (2 more replies)
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:38 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

Trivially get rid of some unnecessary indentation.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 drivers/base/node.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ec8bb24a5a22..6cdf25fd26c3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
 	node->dev.bus = &node_subsys;
 	node->dev.release = node_device_release;
 	node->dev.groups = node_dev_groups;
-	error = device_register(&node->dev);
 
-	if (error)
+	error = device_register(&node->dev);
+	if (error) {
 		put_device(&node->dev);
-	else {
-		hugetlb_register_node(node);
-
-		compaction_register_node(node);
+		return error;
 	}
-	return error;
+
+	hugetlb_register_node(node);
+	compaction_register_node(node);
+	return 0;
 }
 
 /**
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
@ 2022-04-16  5:38 ` Davidlohr Bueso
  2022-04-25 22:32   ` Adam Manzanares
                     ` (3 more replies)
  2022-04-16  5:38 ` [PATCH 3/6] mm: make __node_reclaim() more flexible Davidlohr Bueso
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:38 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

We have helpers for a reason.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cb583fcbf5bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * over remote processors and spread off node memory allocations
 	 * as wide as possible.
 	 */
-	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
 		return NODE_RECLAIM_NOSCAN;
 
 	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] mm: make __node_reclaim() more flexible
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
@ 2022-04-16  5:38 ` Davidlohr Bueso
  2022-04-16  5:39 ` [PATCH 4/6] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:38 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

Currently __node_reclaim() is tailored to the allocator paths. With
proactive per-node reclaim it requires a bit more flexibly:

 - Deal in terms of nr_pages instead of order. Similarly this also
   applies to the respective tracing.
 - Make the caller pass an already armed scan control.
 - Return number of reclaimed pages. The caller can trivially check
   against this explicitly instead.

The current node_reclaim() interface remains the same.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 include/trace/events/vmscan.h | 12 ++++-----
 mm/vmscan.c                   | 47 +++++++++++++++++++----------------
 2 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index de136dbd623a..ab6ce8d8770b 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -439,25 +439,25 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
 
 TRACE_EVENT(mm_vmscan_node_reclaim_begin,
 
-	TP_PROTO(int nid, int order, gfp_t gfp_flags),
+	TP_PROTO(int nid, unsigned long nr_pages, gfp_t gfp_flags),
 
-	TP_ARGS(nid, order, gfp_flags),
+	TP_ARGS(nid, nr_pages, gfp_flags),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
-		__field(int, order)
+		__field(int, nr_pages)
 		__field(gfp_t, gfp_flags)
 	),
 
 	TP_fast_assign(
 		__entry->nid = nid;
-		__entry->order = order;
+		__entry->nr_pages = nr_pages;
 		__entry->gfp_flags = gfp_flags;
 	),
 
-	TP_printk("nid=%d order=%d gfp_flags=%s",
+	TP_printk("nid=%d nr_pages=%d gfp_flags=%s",
 		__entry->nid,
-		__entry->order,
+		__entry->nr_pages,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cb583fcbf5bf..1735c302831c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4668,36 +4668,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
 
 /*
  * Try to free up some pages from this node through reclaim.
+ * Returns the number of reclaimed pages.
  */
-static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
+static unsigned long __node_reclaim(struct pglist_data *pgdat,
+				    gfp_t gfp_mask, unsigned long nr_pages,
+				    struct scan_control *sc)
 {
 	/* Minimum pages needed in order to stay on node */
-	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	unsigned int noreclaim_flag;
-	struct scan_control sc = {
-		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = current_gfp_context(gfp_mask),
-		.order = order,
-		.priority = NODE_RECLAIM_PRIORITY,
-		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
-		.may_swap = 1,
-		.reclaim_idx = gfp_zone(gfp_mask),
-	};
 	unsigned long pflags;
 
-	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
-					   sc.gfp_mask);
+	trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, nr_pages,
+					   sc->gfp_mask);
 
 	cond_resched();
 	psi_memstall_enter(&pflags);
-	fs_reclaim_acquire(sc.gfp_mask);
+	fs_reclaim_acquire(sc->gfp_mask);
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_UNMAP
 	 */
 	noreclaim_flag = memalloc_noreclaim_save();
-	set_task_reclaim_state(p, &sc.reclaim_state);
+	set_task_reclaim_state(p, &sc->reclaim_state);
 
 	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
 		/*
@@ -4705,23 +4697,34 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		 * priorities until we have enough memory freed.
 		 */
 		do {
-			shrink_node(pgdat, &sc);
-		} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
+			shrink_node(pgdat, sc);
+		} while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
 	}
 
 	set_task_reclaim_state(p, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
-	fs_reclaim_release(sc.gfp_mask);
+	fs_reclaim_release(sc->gfp_mask);
 	psi_memstall_leave(&pflags);
 
-	trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
+	trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
 
-	return sc.nr_reclaimed >= nr_pages;
+	return sc->nr_reclaimed;
 }
 
 int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 {
 	int ret;
+	const unsigned long nr_pages = 1 << order;
+	struct scan_control sc = {
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+		.gfp_mask = current_gfp_context(gfp_mask),
+		.order = order,
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
+		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
+		.may_swap = 1,
+		.reclaim_idx = gfp_zone(gfp_mask),
+	};
 
 	/*
 	 * Node reclaim reclaims unmapped file backed pages and
@@ -4756,7 +4759,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
 		return NODE_RECLAIM_NOSCAN;
 
-	ret = __node_reclaim(pgdat, gfp_mask, order);
+	ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
 	clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 
 	if (!ret)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] mm: introduce per-node proactive reclaim interface
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2022-04-16  5:38 ` [PATCH 3/6] mm: make __node_reclaim() more flexible Davidlohr Bueso
@ 2022-04-16  5:39 ` Davidlohr Bueso
  2022-04-19  0:00   ` Tim Chen
  2022-04-16  5:39 ` [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs Davidlohr Bueso
  2022-04-17  3:49 ` [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf Davidlohr Bueso
  5 siblings, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:39 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

This patch introduces a mechanism to trigger memory reclaim
as a per-node sysfs interface, inspired by compaction's
equivalent; ie:

	 echo 1G > /sys/devices/system/node/nodeX/reclaim

It is based on the discussions from David's thread[1] as
well as the current upstreaming of the memcg[2] interface
(which has nice explanations for the benefits of userspace
reclaim overall). In both cases conclusions were that either
way of inducing proactive reclaim should be KISS, and can be
later extended. So this patch does not allow the user much
fine tuning beyond the size of the reclaim, such as anon/file
or whether or semantics of demotion.

[1] https://lore.kernel.org/all/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
[2] https://lore.kernel.org/all/20220408045743.1432968-1-yosryahmed@google.com/

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 Documentation/ABI/stable/sysfs-devices-node | 10 ++++
 drivers/base/node.c                         |  2 +
 include/linux/swap.h                        | 16 ++++++
 mm/vmscan.c                                 | 59 +++++++++++++++++++++
 4 files changed, 87 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 8db67aa472f1..3c935e1334f7 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -182,3 +182,13 @@ Date:		November 2021
 Contact:	Jarkko Sakkinen <jarkko@kernel.org>
 Description:
 		The total amount of SGX physical memory in bytes.
+
+What:		/sys/devices/system/node/nodeX/reclaim
+Date:		April 2022
+Contact:	Davidlohr Bueso <dave@stgolabs.net>
+Description:
+		Write the amount of bytes to induce memory reclaim in this node.
+		This file accepts a single key, the number of bytes to reclaim.
+		When it completes successfully, the specified amount or more memory
+		will have been reclaimed, and -EAGAIN if less bytes are reclaimed
+		than the specified amount.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6cdf25fd26c3..d80c478e2a6e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -670,6 +670,7 @@ static int register_node(struct node *node, int num)
 
 	hugetlb_register_node(node);
 	compaction_register_node(node);
+	reclaim_register_node(node);
 	return 0;
 }
 
@@ -685,6 +686,7 @@ void unregister_node(struct node *node)
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 	node_remove_accesses(node);
 	node_remove_caches(node);
+	reclaim_unregister_node(node);
 	device_unregister(&node->dev);
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 27093b477c5f..cca43ae6d770 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -398,6 +398,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 long remove_mapping(struct address_space *mapping, struct folio *folio);
 
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int reclaim_register_node(struct node *node);
+extern void reclaim_unregister_node(struct node *node);
+
+#else
+
+static inline int reclaim_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void reclaim_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
 extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1735c302831c..3539f8a0f0ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4819,3 +4819,62 @@ void check_move_unevictable_pages(struct pagevec *pvec)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+static ssize_t reclaim_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int err, nid = dev->id;
+	gfp_t gfp_mask = GFP_KERNEL;
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+	struct scan_control sc = {
+		.gfp_mask = current_gfp_context(gfp_mask),
+		.reclaim_idx = gfp_zone(gfp_mask),
+		.priority = NODE_RECLAIM_PRIORITY,
+		.may_writepage = !laptop_mode,
+		.may_unmap = 1,
+		.may_swap = 1,
+	};
+
+	buf = strstrip((char *)buf);
+	err = page_counter_memparse(buf, "", &nr_to_reclaim);
+	if (err)
+		return err;
+
+	sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
+
+	while (nr_reclaimed < nr_to_reclaim) {
+		unsigned long reclaimed;
+
+		if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+			return -EAGAIN;
+
+		/* does cond_resched() */
+		reclaimed = __node_reclaim(pgdat, gfp_mask,
+					   nr_to_reclaim - nr_reclaimed, &sc);
+
+		clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
+}
+
+static DEVICE_ATTR_WO(reclaim);
+int reclaim_register_node(struct node *node)
+{
+	return device_create_file(&node->dev, &dev_attr_reclaim);
+}
+
+void reclaim_unregister_node(struct node *node)
+{
+	return device_remove_file(&node->dev, &dev_attr_reclaim);
+}
+#endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2022-04-16  5:39 ` [PATCH 4/6] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
@ 2022-04-16  5:39 ` Davidlohr Bueso
  2022-04-22 17:31   ` Yang Shi
  2022-04-17  3:49 ` [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf Davidlohr Bueso
  5 siblings, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-16  5:39 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

Add a /sys/devices/system/node/nodeX/demotion_path file
to export the possible target(s) in node_demotion[node].

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 Documentation/ABI/stable/sysfs-devices-node |  6 ++++
 drivers/base/node.c                         | 39 +++++++++++++++++++++
 include/linux/migrate.h                     | 15 ++++++++
 mm/migrate.c                                | 15 +-------
 4 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 3c935e1334f7..f620c6ae013c 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -192,3 +192,9 @@ Description:
 		When it completes successfully, the specified amount or more memory
 		will have been reclaimed, and -EAGAIN if less bytes are reclaimed
 		than the specified amount.
+
+What:		/sys/devices/system/node/nodeX/demotion_path
+Date:		April 2022
+Contact:	Davidlohr Bueso <dave@stgolabs.net>
+Description:
+		Shows nodes within the next tier of slower memory below this node.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index d80c478e2a6e..ab4bae777535 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -17,6 +17,7 @@
 #include <linux/nodemask.h>
 #include <linux/cpu.h>
 #include <linux/device.h>
+#include <linux/migrate.h>
 #include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
@@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
 
+static ssize_t node_read_demotion_path(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	int len = 0;
+	int i;
+	struct demotion_nodes *nd;
+
+	/*
+	 * buf is currently PAGE_SIZE in length and each node needs 4 chars
+	 * at the most (target + space or newline).
+	 */
+	BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
+
+	if (!node_demotion) {
+		len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
+		goto done;
+	}
+
+	nd = &node_demotion[nid];
+
+	rcu_read_lock();
+	if (nd->nr == 0)
+		len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
+	else {
+		for (i = 0; i < nd->nr; i++) {
+			len += sysfs_emit_at(buf, len, "%s%d",
+					     i ? " " : "", nd->nodes[i]);
+		}
+	}
+	rcu_read_unlock();
+done:
+	len += sysfs_emit_at(buf, len, "\n");
+	return len;
+}
+static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_meminfo.attr,
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+	&dev_attr_demotion_path.attr,
 	NULL
 };
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 90e75d5a54d6..b0ac6a717e44 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -111,6 +111,21 @@ static inline int migrate_misplaced_page(struct page *page,
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#define DEFAULT_DEMOTION_TARGET_NODES 15
+
+#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
+#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
+#else
+#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
+#endif
+
+struct demotion_nodes {
+	unsigned short nr;
+	short nodes[DEMOTION_TARGET_NODES];
+};
+
+extern struct demotion_nodes *node_demotion __read_mostly;
+
 #ifdef CONFIG_MIGRATION
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c31ee1e1c9b..e47ea25fcfe8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2172,20 +2172,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
  * must be held over all reads to ensure that no cycles are
  * observed.
  */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
+struct demotion_nodes *node_demotion __read_mostly;
 
 /**
  * next_demotion_node() - Get the next node in the demotion path
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2022-04-16  5:39 ` [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs Davidlohr Bueso
@ 2022-04-17  3:49 ` Davidlohr Bueso
  2022-04-18 15:34   ` Dave Hansen
  2022-04-22 17:37   ` Yang Shi
  5 siblings, 2 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-17  3:49 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel



This allows userspace to know if the node is considered fast
memory (with CPUs attached to it). While this can be already
derived without a new file, this helps further encapsulate the
concept.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
Resending, just noticed this oatch was never posted.

  Documentation/ABI/stable/sysfs-devices-node |  6 ++++++
  drivers/base/node.c                         | 13 +++++++++++++
  2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index f620c6ae013c..1c21c3985535 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -198,3 +198,9 @@ Date:		April 2022
  Contact:	Davidlohr Bueso <dave@stgolabs.net>
  Description:
		Shows nodes within the next tier of slower memory below this node.
+
+What:		/sys/devices/system/node/nodeX/memory_toptier
+Date:		April 2022
+Contact:	Davidlohr Bueso <dave@stgolabs.net>
+Description:
+		Node is attached to fast memory or not.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ab4bae777535..b9de5b0360f2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -598,12 +598,25 @@ static ssize_t node_read_demotion_path(struct device *dev,
  }
  static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);

+static ssize_t node_read_memory_toptier(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	int len = 0;
+
+	len += sysfs_emit_at(buf, len, "%d\n", !!node_is_toptier(nid));
+
+	return len;
+}
+static DEVICE_ATTR(memory_toptier, 0444, node_read_memory_toptier, NULL);
+
  static struct attribute *node_dev_attrs[] = {
	&dev_attr_meminfo.attr,
	&dev_attr_numastat.attr,
	&dev_attr_distance.attr,
	&dev_attr_vmstat.attr,
	&dev_attr_demotion_path.attr,
+	&dev_attr_memory_toptier.attr,
	NULL
  };

--
2.26.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-17  3:49 ` [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf Davidlohr Bueso
@ 2022-04-18 15:34   ` Dave Hansen
  2022-04-18 16:45     ` Davidlohr Bueso
  2022-04-22 17:37   ` Yang Shi
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-04-18 15:34 UTC (permalink / raw)
  To: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On 4/16/22 20:49, Davidlohr Bueso wrote:
> This allows userspace to know if the node is considered fast
> memory (with CPUs attached to it). While this can be already
> derived without a new file, this helps further encapsulate the
> concept.

What is userspace supposed to *do* with this, though?

What does "attached" mean?

Isn't it just asking for trouble to add (known) redundancy to the ABI?
It seems like a recipe for future inconsistency.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-18 15:34   ` Dave Hansen
@ 2022-04-18 16:45     ` Davidlohr Bueso
  2022-04-18 16:50       ` Dave Hansen
  0 siblings, 1 reply; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-18 16:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Mon, 18 Apr 2022, Dave Hansen wrote:

>On 4/16/22 20:49, Davidlohr Bueso wrote:
>> This allows userspace to know if the node is considered fast
>> memory (with CPUs attached to it). While this can be already
>> derived without a new file, this helps further encapsulate the
>> concept.
>
>What is userspace supposed to *do* with this, though?

This came as a scratch to my own itch. I wanted to start testing
more tiering patches overall that I see pop up, and wanted a way
to differentiate the slow vs the fast memories in order to better
configure workload(s) working set sizes beyond what is your typical
grep MemTotal /proc/meminfo. If there is a better way I'm all
for it.

>
>What does "attached" mean?

I'll rephrase.

>Isn't it just asking for trouble to add (known) redundancy to the ABI?
>It seems like a recipe for future inconsistency.

Perhaps. It was mostly about the fact that the notion of top tier
could also change as technology evolves.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-18 16:45     ` Davidlohr Bueso
@ 2022-04-18 16:50       ` Dave Hansen
  2022-04-18 17:01         ` Davidlohr Bueso
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Hansen @ 2022-04-18 16:50 UTC (permalink / raw)
  To: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On 4/18/22 09:45, Davidlohr Bueso wrote:
> On Mon, 18 Apr 2022, Dave Hansen wrote:
>> On 4/16/22 20:49, Davidlohr Bueso wrote:
>>> This allows userspace to know if the node is considered fast
>>> memory (with CPUs attached to it). While this can be already
>>> derived without a new file, this helps further encapsulate the
>>> concept.
>>
>> What is userspace supposed to *do* with this, though?
> 
> This came as a scratch to my own itch. I wanted to start testing
> more tiering patches overall that I see pop up, and wanted a way
> to differentiate the slow vs the fast memories in order to better
> configure workload(s) working set sizes beyond what is your typical
> grep MemTotal /proc/meminfo. If there is a better way I'm all
> for it.

But how does this help you?  Does it save you a few lines in a shell
script to find the nodes that have memory and CPUs?

>> Isn't it just asking for trouble to add (known) redundancy to the ABI?
>> It seems like a recipe for future inconsistency.
> 
> Perhaps. It was mostly about the fact that the notion of top tier
> could also change as technology evolves.

It seems like something arbitrary that everyone will just disagree on.
I think we should try to stick to cold, hard facts as must as possible
rather than trying to have the *kernel* dictate as a policy what is fast
versus slow.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-18 16:50       ` Dave Hansen
@ 2022-04-18 17:01         ` Davidlohr Bueso
  0 siblings, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-18 17:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Mon, 18 Apr 2022, Dave Hansen wrote:

>I think we should try to stick to cold, hard facts as must as possible
>rather than trying to have the *kernel* dictate as a policy what is fast
>versus slow.

That's a very good point and I agree.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] mm: introduce per-node proactive reclaim interface
  2022-04-16  5:39 ` [PATCH 4/6] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
@ 2022-04-19  0:00   ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2022-04-19  0:00 UTC (permalink / raw)
  To: Davidlohr Bueso, linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, roman.gushchin, gthelen, a.manzanares, heekwon.p,
	gim.jongmin, linux-kernel

On Fri, 2022-04-15 at 22:39 -0700, Davidlohr Bueso wrote:
> This patch introduces a mechanism to trigger memory reclaim
> as a per-node sysfs interface, inspired by compaction's
> equivalent; ie:
> 
> 	 echo 1G > /sys/devices/system/node/nodeX/reclaim
> 

I think it will be more flexible to specify a node mask
as a parameter along with amount of memory with the 
memory.reclaim memcg interface proposed by Yosry.  Doing it node
by node is more cumbersome.  It is just a special case
of reclaiming from root cgroup for a specific node.

Wei Gu, YIng and I have some discssions on this
https://lore.kernel.org/all/df6110a09cacc80ee1cbe905a71273a5f3953e16.camel@linux.intel.com/  

 
Tim

> It is based on the discussions from David's thread[1] as
> well as the current upstreaming of the memcg[2] interface
> (which has nice explanations for the benefits of userspace
> reclaim overall). In both cases conclusions were that either
> way of inducing proactive reclaim should be KISS, and can be
> later extended. So this patch does not allow the user much
> fine tuning beyond the size of the reclaim, such as anon/file
> or whether or semantics of demotion.
> 
> [1] https://lore.kernel.org/all/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> [2] https://lore.kernel.org/all/20220408045743.1432968-1-yosryahmed@google.com/
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  Documentation/ABI/stable/sysfs-devices-node | 10 ++++
>  drivers/base/node.c                         |  2 +
>  include/linux/swap.h                        | 16 ++++++
>  mm/vmscan.c                                 | 59 +++++++++++++++++++++
>  4 files changed, 87 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 8db67aa472f1..3c935e1334f7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -182,3 +182,13 @@ Date:		November 2021
>  Contact:	Jarkko Sakkinen <jarkko@kernel.org>
>  Description:
>  		The total amount of SGX physical memory in bytes.
> +
> +What:		/sys/devices/system/node/nodeX/reclaim
> +Date:		April 2022
> +Contact:	Davidlohr Bueso <dave@stgolabs.net>
> +Description:
> +		Write the amount of bytes to induce memory reclaim in this node.
> +		This file accepts a single key, the number of bytes to reclaim.
> +		When it completes successfully, the specified amount or more memory
> +		will have been reclaimed, and -EAGAIN if less bytes are reclaimed
> +		than the specified amount.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 6cdf25fd26c3..d80c478e2a6e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -670,6 +670,7 @@ static int register_node(struct node *node, int num)
>  
>  	hugetlb_register_node(node);
>  	compaction_register_node(node);
> +	reclaim_register_node(node);
>  	return 0;
>  }
>  
> @@ -685,6 +686,7 @@ void unregister_node(struct node *node)
>  	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
>  	node_remove_accesses(node);
>  	node_remove_caches(node);
> +	reclaim_unregister_node(node);
>  	device_unregister(&node->dev);
>  }
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 27093b477c5f..cca43ae6d770 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -398,6 +398,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  long remove_mapping(struct address_space *mapping, struct folio *folio);
>  
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> +	return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
>  extern unsigned long reclaim_pages(struct list_head *page_list);
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1735c302831c..3539f8a0f0ea 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4819,3 +4819,62 @@ void check_move_unevictable_pages(struct pagevec *pvec)
>  	}
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> +
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +static ssize_t reclaim_store(struct device *dev,
> +			     struct device_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	int err, nid = dev->id;
> +	gfp_t gfp_mask = GFP_KERNEL;
> +	struct pglist_data *pgdat = NODE_DATA(nid);
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	struct scan_control sc = {
> +		.gfp_mask = current_gfp_context(gfp_mask),
> +		.reclaim_idx = gfp_zone(gfp_mask),
> +		.priority = NODE_RECLAIM_PRIORITY,
> +		.may_writepage = !laptop_mode,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	buf = strstrip((char *)buf);
> +	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +	if (err)
> +		return err;
> +
> +	sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> +			return -EAGAIN;
> +
> +		/* does cond_resched() */
> +		reclaimed = __node_reclaim(pgdat, gfp_mask,
> +					   nr_to_reclaim - nr_reclaimed, &sc);
> +
> +		clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +
> +		nr_reclaimed += reclaimed;
> +	}
> +
> +	return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
> +}
> +
> +static DEVICE_ATTR_WO(reclaim);
> +int reclaim_register_node(struct node *node)
> +{
> +	return device_create_file(&node->dev, &dev_attr_reclaim);
> +}
> +
> +void reclaim_unregister_node(struct node *node)
> +{
> +	return device_remove_file(&node->dev, &dev_attr_reclaim);
> +}
> +#endif


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs
  2022-04-16  5:39 ` [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs Davidlohr Bueso
@ 2022-04-22 17:31   ` Yang Shi
  2022-04-22 17:33     ` Yang Shi
  0 siblings, 1 reply; 25+ messages in thread
From: Yang Shi @ 2022-04-22 17:31 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Fri, Apr 15, 2022 at 10:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> Add a /sys/devices/system/node/nodeX/demotion_path file
> to export the possible target(s) in node_demotion[node].

I'm not sure if you noticed that Jagdish Gediya is working on the
similar patch, please see
https://lore.kernel.org/linux-mm/20220413092206.73974-1-jvgediya@linux.ibm.com/

It would be better to combine the two to avoid duplicate effort.

>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  Documentation/ABI/stable/sysfs-devices-node |  6 ++++
>  drivers/base/node.c                         | 39 +++++++++++++++++++++
>  include/linux/migrate.h                     | 15 ++++++++
>  mm/migrate.c                                | 15 +-------
>  4 files changed, 61 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 3c935e1334f7..f620c6ae013c 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -192,3 +192,9 @@ Description:
>                 When it completes successfully, the specified amount or more memory
>                 will have been reclaimed, and -EAGAIN if less bytes are reclaimed
>                 than the specified amount.
> +
> +What:          /sys/devices/system/node/nodeX/demotion_path
> +Date:          April 2022
> +Contact:       Davidlohr Bueso <dave@stgolabs.net>
> +Description:
> +               Shows nodes within the next tier of slower memory below this node.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index d80c478e2a6e..ab4bae777535 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/cpu.h>
>  #include <linux/device.h>
> +#include <linux/migrate.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
> @@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>
> +static ssize_t node_read_demotion_path(struct device *dev,
> +                                      struct device_attribute *attr, char *buf)
> +{
> +       int nid = dev->id;
> +       int len = 0;
> +       int i;
> +       struct demotion_nodes *nd;
> +
> +       /*
> +        * buf is currently PAGE_SIZE in length and each node needs 4 chars
> +        * at the most (target + space or newline).
> +        */
> +       BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
> +
> +       if (!node_demotion) {
> +               len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
> +               goto done;
> +       }
> +
> +       nd = &node_demotion[nid];
> +
> +       rcu_read_lock();
> +       if (nd->nr == 0)
> +               len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
> +       else {
> +               for (i = 0; i < nd->nr; i++) {
> +                       len += sysfs_emit_at(buf, len, "%s%d",
> +                                            i ? " " : "", nd->nodes[i]);
> +               }
> +       }
> +       rcu_read_unlock();
> +done:
> +       len += sysfs_emit_at(buf, len, "\n");
> +       return len;
> +}
> +static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);
> +
>  static struct attribute *node_dev_attrs[] = {
>         &dev_attr_meminfo.attr,
>         &dev_attr_numastat.attr,
>         &dev_attr_distance.attr,
>         &dev_attr_vmstat.attr,
> +       &dev_attr_demotion_path.attr,
>         NULL
>  };
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 90e75d5a54d6..b0ac6a717e44 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -111,6 +111,21 @@ static inline int migrate_misplaced_page(struct page *page,
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>
> +#define DEFAULT_DEMOTION_TARGET_NODES 15
> +
> +#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> +#define DEMOTION_TARGET_NODES  (MAX_NUMNODES - 1)
> +#else
> +#define DEMOTION_TARGET_NODES  DEFAULT_DEMOTION_TARGET_NODES
> +#endif
> +
> +struct demotion_nodes {
> +       unsigned short nr;
> +       short nodes[DEMOTION_TARGET_NODES];
> +};
> +
> +extern struct demotion_nodes *node_demotion __read_mostly;
> +
>  #ifdef CONFIG_MIGRATION
>
>  /*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c31ee1e1c9b..e47ea25fcfe8 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2172,20 +2172,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>   * must be held over all reads to ensure that no cycles are
>   * observed.
>   */
> -#define DEFAULT_DEMOTION_TARGET_NODES 15
> -
> -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> -#define DEMOTION_TARGET_NODES  (MAX_NUMNODES - 1)
> -#else
> -#define DEMOTION_TARGET_NODES  DEFAULT_DEMOTION_TARGET_NODES
> -#endif
> -
> -struct demotion_nodes {
> -       unsigned short nr;
> -       short nodes[DEMOTION_TARGET_NODES];
> -};
> -
> -static struct demotion_nodes *node_demotion __read_mostly;
> +struct demotion_nodes *node_demotion __read_mostly;
>
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
> --
> 2.26.2
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs
  2022-04-22 17:31   ` Yang Shi
@ 2022-04-22 17:33     ` Yang Shi
  2022-04-22 17:50       ` Davidlohr Bueso
  0 siblings, 1 reply; 25+ messages in thread
From: Yang Shi @ 2022-04-22 17:33 UTC (permalink / raw)
  To: Davidlohr Bueso, jvgediya, ying.huang, weixugc
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Fri, Apr 22, 2022 at 10:31 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Apr 15, 2022 at 10:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> > Add a /sys/devices/system/node/nodeX/demotion_path file
> > to export the possible target(s) in node_demotion[node].
>
> I'm not sure if you noticed that Jagdish Gediya is working on the
> similar patch, please see
> https://lore.kernel.org/linux-mm/20220413092206.73974-1-jvgediya@linux.ibm.com/

Loop in Jagdish Gediya, Ying Huang and Wei Xu.

>
> It would be better to combine the two to avoid duplicate effort.
>
> >
> > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> > ---
> >  Documentation/ABI/stable/sysfs-devices-node |  6 ++++
> >  drivers/base/node.c                         | 39 +++++++++++++++++++++
> >  include/linux/migrate.h                     | 15 ++++++++
> >  mm/migrate.c                                | 15 +-------
> >  4 files changed, 61 insertions(+), 14 deletions(-)
> >
> > diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> > index 3c935e1334f7..f620c6ae013c 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-node
> > +++ b/Documentation/ABI/stable/sysfs-devices-node
> > @@ -192,3 +192,9 @@ Description:
> >                 When it completes successfully, the specified amount or more memory
> >                 will have been reclaimed, and -EAGAIN if less bytes are reclaimed
> >                 than the specified amount.
> > +
> > +What:          /sys/devices/system/node/nodeX/demotion_path
> > +Date:          April 2022
> > +Contact:       Davidlohr Bueso <dave@stgolabs.net>
> > +Description:
> > +               Shows nodes within the next tier of slower memory below this node.
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index d80c478e2a6e..ab4bae777535 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -17,6 +17,7 @@
> >  #include <linux/nodemask.h>
> >  #include <linux/cpu.h>
> >  #include <linux/device.h>
> > +#include <linux/migrate.h>
> >  #include <linux/pm_runtime.h>
> >  #include <linux/swap.h>
> >  #include <linux/slab.h>
> > @@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev,
> >  }
> >  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
> >
> > +static ssize_t node_read_demotion_path(struct device *dev,
> > +                                      struct device_attribute *attr, char *buf)
> > +{
> > +       int nid = dev->id;
> > +       int len = 0;
> > +       int i;
> > +       struct demotion_nodes *nd;
> > +
> > +       /*
> > +        * buf is currently PAGE_SIZE in length and each node needs 4 chars
> > +        * at the most (target + space or newline).
> > +        */
> > +       BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
> > +
> > +       if (!node_demotion) {
> > +               len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
> > +               goto done;
> > +       }
> > +
> > +       nd = &node_demotion[nid];
> > +
> > +       rcu_read_lock();
> > +       if (nd->nr == 0)
> > +               len += sysfs_emit_at(buf, len, "%d", NUMA_NO_NODE);
> > +       else {
> > +               for (i = 0; i < nd->nr; i++) {
> > +                       len += sysfs_emit_at(buf, len, "%s%d",
> > +                                            i ? " " : "", nd->nodes[i]);
> > +               }
> > +       }
> > +       rcu_read_unlock();
> > +done:
> > +       len += sysfs_emit_at(buf, len, "\n");
> > +       return len;
> > +}
> > +static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);
> > +
> >  static struct attribute *node_dev_attrs[] = {
> >         &dev_attr_meminfo.attr,
> >         &dev_attr_numastat.attr,
> >         &dev_attr_distance.attr,
> >         &dev_attr_vmstat.attr,
> > +       &dev_attr_demotion_path.attr,
> >         NULL
> >  };
> >
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 90e75d5a54d6..b0ac6a717e44 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -111,6 +111,21 @@ static inline int migrate_misplaced_page(struct page *page,
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >
> > +#define DEFAULT_DEMOTION_TARGET_NODES 15
> > +
> > +#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> > +#define DEMOTION_TARGET_NODES  (MAX_NUMNODES - 1)
> > +#else
> > +#define DEMOTION_TARGET_NODES  DEFAULT_DEMOTION_TARGET_NODES
> > +#endif
> > +
> > +struct demotion_nodes {
> > +       unsigned short nr;
> > +       short nodes[DEMOTION_TARGET_NODES];
> > +};
> > +
> > +extern struct demotion_nodes *node_demotion __read_mostly;
> > +
> >  #ifdef CONFIG_MIGRATION
> >
> >  /*
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 6c31ee1e1c9b..e47ea25fcfe8 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2172,20 +2172,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> >   * must be held over all reads to ensure that no cycles are
> >   * observed.
> >   */
> > -#define DEFAULT_DEMOTION_TARGET_NODES 15
> > -
> > -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
> > -#define DEMOTION_TARGET_NODES  (MAX_NUMNODES - 1)
> > -#else
> > -#define DEMOTION_TARGET_NODES  DEFAULT_DEMOTION_TARGET_NODES
> > -#endif
> > -
> > -struct demotion_nodes {
> > -       unsigned short nr;
> > -       short nodes[DEMOTION_TARGET_NODES];
> > -};
> > -
> > -static struct demotion_nodes *node_demotion __read_mostly;
> > +struct demotion_nodes *node_demotion __read_mostly;
> >
> >  /**
> >   * next_demotion_node() - Get the next node in the demotion path
> > --
> > 2.26.2
> >
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf
  2022-04-17  3:49 ` [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf Davidlohr Bueso
  2022-04-18 15:34   ` Dave Hansen
@ 2022-04-22 17:37   ` Yang Shi
  1 sibling, 0 replies; 25+ messages in thread
From: Yang Shi @ 2022-04-22 17:37 UTC (permalink / raw)
  To: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel, jvgediya, ying.huang

On Sat, Apr 16, 2022 at 8:49 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>
>
> This allows userspace to know if the node is considered fast
> memory (with CPUs attached to it). While this can be already
> derived without a new file, this helps further encapsulate the
> concept.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
> Resending, just noticed this oatch was never posted.
>
>   Documentation/ABI/stable/sysfs-devices-node |  6 ++++++
>   drivers/base/node.c                         | 13 +++++++++++++
>   2 files changed, 19 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index f620c6ae013c..1c21c3985535 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -198,3 +198,9 @@ Date:               April 2022
>   Contact:      Davidlohr Bueso <dave@stgolabs.net>
>   Description:
>                 Shows nodes within the next tier of slower memory below this node.
> +
> +What:          /sys/devices/system/node/nodeX/memory_toptier
> +Date:          April 2022
> +Contact:       Davidlohr Bueso <dave@stgolabs.net>
> +Description:
> +               Node is attached to fast memory or not.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ab4bae777535..b9de5b0360f2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -598,12 +598,25 @@ static ssize_t node_read_demotion_path(struct device *dev,
>   }
>   static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);
>
> +static ssize_t node_read_memory_toptier(struct device *dev,
> +                                    struct device_attribute *attr, char *buf)
> +{
> +       int nid = dev->id;
> +       int len = 0;
> +
> +       len += sysfs_emit_at(buf, len, "%d\n", !!node_is_toptier(nid));

It is not guaranteed. Some hardware configurations have cpuless DRAM
nodes, but they should be treated as top tier nodes IMHO. Please see
https://lore.kernel.org/linux-mm/20220413092206.73974-1-jvgediya@linux.ibm.com/

> +
> +       return len;
> +}
> +static DEVICE_ATTR(memory_toptier, 0444, node_read_memory_toptier, NULL);
> +
>   static struct attribute *node_dev_attrs[] = {
>         &dev_attr_meminfo.attr,
>         &dev_attr_numastat.attr,
>         &dev_attr_distance.attr,
>         &dev_attr_vmstat.attr,
>         &dev_attr_demotion_path.attr,
> +       &dev_attr_memory_toptier.attr,
>         NULL
>   };
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs
  2022-04-22 17:33     ` Yang Shi
@ 2022-04-22 17:50       ` Davidlohr Bueso
  0 siblings, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-04-22 17:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: jvgediya, ying.huang, weixugc, linux-mm, mhocko, akpm, rientjes,
	yosryahmed, hannes, shakeelb, dave.hansen, tim.c.chen,
	roman.gushchin, gthelen, a.manzanares, heekwon.p, gim.jongmin,
	linux-kernel

On Fri, 22 Apr 2022, Yang Shi wrote:

>On Fri, Apr 22, 2022 at 10:31 AM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Fri, Apr 15, 2022 at 10:39 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>> >
>> > Add a /sys/devices/system/node/nodeX/demotion_path file
>> > to export the possible target(s) in node_demotion[node].
>>
>> I'm not sure if you noticed that Jagdish Gediya is working on the
>> similar patch, please see
>> https://lore.kernel.org/linux-mm/20220413092206.73974-1-jvgediya@linux.ibm.com/
>
>Loop in Jagdish Gediya, Ying Huang and Wei Xu.
>

Hmm I had missed this thread, I'll go have a look.

>>
>> It would be better to combine the two to avoid duplicate effort.

Indeed - and even more reason for lsfmm discussions defining the
future ABI for tiering.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] drivers/base/node: cleanup register_node()
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
@ 2022-04-25 22:30   ` Adam Manzanares
  2022-05-03 18:17   ` David Hildenbrand
  2022-05-04  4:33   ` David Rientjes
  2 siblings, 0 replies; 25+ messages in thread
From: Adam Manzanares @ 2022-04-25 22:30 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, Heekwon Park,
	Jongmin Gim, linux-kernel

On Fri, Apr 15, 2022 at 10:38:57PM -0700, Davidlohr Bueso wrote:
> Trivially get rid of some unnecessary indentation.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  drivers/base/node.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..6cdf25fd26c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
>  	node->dev.bus = &node_subsys;
>  	node->dev.release = node_device_release;
>  	node->dev.groups = node_dev_groups;
> -	error = device_register(&node->dev);
>  
> -	if (error)
> +	error = device_register(&node->dev);
> +	if (error) {
>  		put_device(&node->dev);
> -	else {
> -		hugetlb_register_node(node);
> -
> -		compaction_register_node(node);
> +		return error;
>  	}
> -	return error;
> +
> +	hugetlb_register_node(node);
> +	compaction_register_node(node);
> +	return 0;
>  }
>  
>  /**
> -- 
> 2.26.2
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
@ 2022-04-25 22:32   ` Adam Manzanares
  2022-05-04  4:33   ` David Rientjes
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Adam Manzanares @ 2022-04-25 22:32 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, Heekwon Park,
	Jongmin Gim, linux-kernel

On Fri, Apr 15, 2022 at 10:38:58PM -0700, Davidlohr Bueso wrote:
> We have helpers for a reason.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	 * over remote processors and spread off node memory allocations
>  	 * as wide as possible.
>  	 */
> -	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> +	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
>  		return NODE_RECLAIM_NOSCAN;
>  
>  	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> -- 
> 2.26.2
>


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] drivers/base/node: cleanup register_node()
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
  2022-04-25 22:30   ` Adam Manzanares
@ 2022-05-03 18:17   ` David Hildenbrand
  2022-05-04  4:33   ` David Rientjes
  2 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2022-05-03 18:17 UTC (permalink / raw)
  To: Davidlohr Bueso, linux-mm
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On 16.04.22 07:38, Davidlohr Bueso wrote:
> Trivially get rid of some unnecessary indentation.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  drivers/base/node.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..6cdf25fd26c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
>  	node->dev.bus = &node_subsys;
>  	node->dev.release = node_device_release;
>  	node->dev.groups = node_dev_groups;
> -	error = device_register(&node->dev);
>  
> -	if (error)
> +	error = device_register(&node->dev);
> +	if (error) {
>  		put_device(&node->dev);
> -	else {
> -		hugetlb_register_node(node);
> -
> -		compaction_register_node(node);
> +		return error;
>  	}
> -	return error;
> +
> +	hugetlb_register_node(node);
> +	compaction_register_node(node);
> +	return 0;
>  }
>  
>  /**

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] drivers/base/node: cleanup register_node()
  2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
  2022-04-25 22:30   ` Adam Manzanares
  2022-05-03 18:17   ` David Hildenbrand
@ 2022-05-04  4:33   ` David Rientjes
  2 siblings, 0 replies; 25+ messages in thread
From: David Rientjes @ 2022-05-04  4:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Fri, 15 Apr 2022, Davidlohr Bueso wrote:

> Trivially get rid of some unnecessary indentation.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
  2022-04-25 22:32   ` Adam Manzanares
@ 2022-05-04  4:33   ` David Rientjes
  2022-05-04  7:26   ` Jagdish Gediya
  2022-05-31 11:50   ` Aneesh Kumar K.V
  3 siblings, 0 replies; 25+ messages in thread
From: David Rientjes @ 2022-05-04  4:33 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Fri, 15 Apr 2022, Davidlohr Bueso wrote:

> We have helpers for a reason.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
  2022-04-25 22:32   ` Adam Manzanares
  2022-05-04  4:33   ` David Rientjes
@ 2022-05-04  7:26   ` Jagdish Gediya
  2022-05-31 11:50   ` Aneesh Kumar K.V
  3 siblings, 0 replies; 25+ messages in thread
From: Jagdish Gediya @ 2022-05-04  7:26 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Fri, Apr 15, 2022 at 10:38:58PM -0700, Davidlohr Bueso wrote:
> We have helpers for a reason.
> 
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	 * over remote processors and spread off node memory allocations
>  	 * as wide as possible.
>  	 */
> -	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> +	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
Currently node_is_toptier returns all N_CPU node as toptier but N_CPU
node will not stay necessarily top tier as per discussions on below
thread.

https://lore.kernel.org/linux-mm/CAAPL-u9sVx94ACSuCVN8V0tKp+AMxiY89cro0japtyB=xNfNBw@mail.gmail.com/

node_is_toptier() definition can change based on the discussion in above
thread.
>  		return NODE_RECLAIM_NOSCAN;
>  
>  	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> -- 
> 2.26.2
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2022-05-04  7:26   ` Jagdish Gediya
@ 2022-05-31 11:50   ` Aneesh Kumar K.V
  2022-06-01  6:12     ` Ying Huang
  3 siblings, 1 reply; 25+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-31 11:50 UTC (permalink / raw)
  To: Davidlohr Bueso, linux-mm, Wei Xu, Huang Ying
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, dave, linux-kernel

Davidlohr Bueso <dave@stgolabs.net> writes:

> We have helpers for a reason.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	 * over remote processors and spread off node memory allocations
>  	 * as wide as possible.
>  	 */
> -	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> +	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
>  		return NODE_RECLAIM_NOSCAN;
>  
>  	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))


Are we really looking at the top tier in a tiered memory hierarchy here?
The comment seems to suggest we are looking at local NUMA node?


-aneesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-05-31 11:50   ` Aneesh Kumar K.V
@ 2022-06-01  6:12     ` Ying Huang
  2022-06-01 14:00       ` Davidlohr Bueso
  0 siblings, 1 reply; 25+ messages in thread
From: Ying Huang @ 2022-06-01  6:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Davidlohr Bueso, linux-mm, Wei Xu
  Cc: mhocko, akpm, rientjes, yosryahmed, hannes, shakeelb,
	dave.hansen, tim.c.chen, roman.gushchin, gthelen, a.manzanares,
	heekwon.p, gim.jongmin, linux-kernel

On Tue, 2022-05-31 at 17:20 +0530, Aneesh Kumar K.V wrote:
> Davidlohr Bueso <dave@stgolabs.net> writes:
> 
> > We have helpers for a reason.
> > 
> > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1678802e03e7..cb583fcbf5bf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> >  	 * over remote processors and spread off node memory allocations
> >  	 * as wide as possible.
> >  	 */
> > -	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> > +	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
> >  		return NODE_RECLAIM_NOSCAN;
> >  
> > 
> >  	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> 
> 
> Are we really looking at the top tier in a tiered memory hierarchy here?
> The comment seems to suggest we are looking at local NUMA node?

The code change itself is correct.  But it is an implementation details
that node_is_toptier() == node_state(, N_CPU).  And after we supporting
more memory tiers (like GPU, HBM), we will change the implementation of
node_is_toptier() soon.   So I think that it's better to keep the
original code.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim
  2022-06-01  6:12     ` Ying Huang
@ 2022-06-01 14:00       ` Davidlohr Bueso
  0 siblings, 0 replies; 25+ messages in thread
From: Davidlohr Bueso @ 2022-06-01 14:00 UTC (permalink / raw)
  To: Ying Huang
  Cc: Aneesh Kumar K.V, linux-mm, Wei Xu, mhocko, akpm, rientjes,
	yosryahmed, hannes, shakeelb, dave.hansen, tim.c.chen,
	roman.gushchin, gthelen, a.manzanares, heekwon.p, gim.jongmin,
	linux-kernel

On Wed, 01 Jun 2022, Ying Huang wrote:

>On Tue, 2022-05-31 at 17:20 +0530, Aneesh Kumar K.V wrote:
>> Davidlohr Bueso <dave@stgolabs.net> writes:
>>
>> > We have helpers for a reason.
>> >
>> > Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
>> > ---
>> >  mm/vmscan.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 1678802e03e7..cb583fcbf5bf 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>> >  	 * over remote processors and spread off node memory allocations
>> >  	 * as wide as possible.
>> >  	 */
>> > -	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
>> > +	if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
>> >  		return NODE_RECLAIM_NOSCAN;
>> >
>> >
>> >  	if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
>>
>>
>> Are we really looking at the top tier in a tiered memory hierarchy here?
>> The comment seems to suggest we are looking at local NUMA node?
>
>The code change itself is correct.  But it is an implementation details
>that node_is_toptier() == node_state(, N_CPU).  And after we supporting
>more memory tiers (like GPU, HBM), we will change the implementation of
>node_is_toptier() soon.   So I think that it's better to keep the
>original code.

Agreed.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-06-01 14:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-16  5:38 [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics Davidlohr Bueso
2022-04-16  5:38 ` [PATCH 1/6] drivers/base/node: cleanup register_node() Davidlohr Bueso
2022-04-25 22:30   ` Adam Manzanares
2022-05-03 18:17   ` David Hildenbrand
2022-05-04  4:33   ` David Rientjes
2022-04-16  5:38 ` [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim Davidlohr Bueso
2022-04-25 22:32   ` Adam Manzanares
2022-05-04  4:33   ` David Rientjes
2022-05-04  7:26   ` Jagdish Gediya
2022-05-31 11:50   ` Aneesh Kumar K.V
2022-06-01  6:12     ` Ying Huang
2022-06-01 14:00       ` Davidlohr Bueso
2022-04-16  5:38 ` [PATCH 3/6] mm: make __node_reclaim() more flexible Davidlohr Bueso
2022-04-16  5:39 ` [PATCH 4/6] mm: introduce per-node proactive reclaim interface Davidlohr Bueso
2022-04-19  0:00   ` Tim Chen
2022-04-16  5:39 ` [PATCH 5/6] mm/migration: export demotion_path of a node via sysfs Davidlohr Bueso
2022-04-22 17:31   ` Yang Shi
2022-04-22 17:33     ` Yang Shi
2022-04-22 17:50       ` Davidlohr Bueso
2022-04-17  3:49 ` [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf Davidlohr Bueso
2022-04-18 15:34   ` Dave Hansen
2022-04-18 16:45     ` Davidlohr Bueso
2022-04-18 16:50       ` Dave Hansen
2022-04-18 17:01         ` Davidlohr Bueso
2022-04-22 17:37   ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.