linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
@ 2021-04-01 18:32 Dave Hansen
  2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
                   ` (10 more replies)
  0 siblings, 11 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, yang.shi, rientjes, ying.huang,
	dan.j.williams, david, osalvador, weixugc

I'm resending this because I forgot to cc the mailing lists on the
post yesterday.  Sorry for the noise.  Please reply to this series.

The full series is also available here:

	https://github.com/hansendc/linux/tree/automigrate-20210331

which also inclues some vm.zone_reclaim_mode sysctl ABI fixup
prerequisites:

	https://github.com/hansendc/linux/commit/18daad8f0181a2da57cb43e595303c2ef5bd7b6e
	https://github.com/hansendc/linux/commit/a873f3b6f250581072ab36f2735a3aa341e36705

There are no major changes since the last post.

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Huang Ying has follow-on work which
repurposes autonuma to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

== Open Issues ==

 * Memory policies and cpusets that, for instance, restrict allocations
   to DRAM can be demoted to PMEM whenever they opt in to this
   new mechanism.  A cgroup-level API to opt-in or opt-out of
   these migrations will likely be required as a follow-on.
 * Could be more aggressive about where anon LRU scanning occurs
   since it no longer necessarily involves I/O.  get_scan_count()
   for instance says: "If we have no swap space, do not bother
   scanning anon pages"

--

 Documentation/admin-guide/sysctl/vm.rst |  12 +
 include/linux/migrate.h                 |  14 +-
 include/linux/swap.h                    |   3 +-
 include/linux/vm_event_item.h           |   2 +
 include/trace/events/migrate.h          |   3 +-
 include/uapi/linux/mempolicy.h          |   1 +
 mm/compaction.c                         |   3 +-
 mm/gup.c                                |   3 +-
 mm/internal.h                           |   5 +
 mm/memory-failure.c                     |   4 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/mempolicy.c                          |   8 +-
 mm/migrate.c                            | 315 +++++++++++++++++++++++-
 mm/page_alloc.c                         |  11 +-
 mm/vmscan.c                             | 158 +++++++++++-
 mm/vmstat.c                             |   2 +
 16 files changed, 520 insertions(+), 28 deletions(-)

--

Changes since (automigrate-20210304):
 * Add ack/review tags
 * Remove duplicate synchronize_rcu() call

Changes since (automigrate-20210122):
 * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE since pages *are*
   movable.
 * Separate out helpers that check for being able to relaim anonymous
   pages versus being able to meaningfully scan the anon LRU.

Changes since (automigrate-20200818):
 * Fall back to normal reclaim when demotion fails
 * Fix some compile issues, when page migration and NUMA are off

Changes since (automigrate-20201007):
 * separate out checks for "can scan anon LRU" from "can actually
   swap anon pages right now".  Previous series conflated them
   and may have been overly aggressive scanning LRU
 * add MR_DEMOTION to tracepoint header
 * remove unnecessary hugetlb page check

Changes since (https://lwn.net/Articles/824830/):
 * Use higher-level migrate_pages() API approach from Yang Shi's
   earlier patches.
 * made sure to actually check node_reclaim_mode's new bit
 * disabled migration entirely before introducing RECLAIM_MIGRATE
 * Replace GFP_NOWAIT with explicit __GFP_KSWAPD_RECLAIM and
   comment why we want that.
 * Comment on effects of that keep multiple source nodes from
   sharing target nodes

Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Wei Xu <weixugc@google.com>



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/10] mm/numa: node demotion data structure and lookup
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-08  8:03   ` Oscar Salvador
  2021-04-09  5:32   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, shy828301, weixugc, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

changes since 20200122:
 * Make node_demotion[] __read_mostly

changes in July 2020:
 - Remove loop from next_demotion_node() and get_online_mems().
   This means that the node returned by next_demotion_node()
   might now be offline, but the worst case is that the
   allocation fails.  That's fine since it is transient.
---

 b/mm/migrate.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path	2021-03-31 15:17:10.734000264 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:10.742000264 -0700
@@ -1163,6 +1163,23 @@ out:
 	return rc;
 }
 
+static int node_demotion[MAX_NUMNODES] __read_mostly =
+	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	return node_demotion[node];
+}
+
 /*
  * Obtain the lock on page, remove all ptes and migrate the page
  * to the newly allocated page in newpage.
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
  2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-08  8:26   ` Oscar Salvador
  2021-04-10  3:07   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events Dave Hansen
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, shy828301, weixugc, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list would
be to follow the page allocator fallback lists.  Those lists
already tell us if memory is full where to look next.  It would
also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to migration
cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes from 20200122:
 * Add big node_demotion[] comment
Changes from 20210302:
 * Fix typo in node_demotion[] comment
---

 b/mm/internal.h   |    5 +
 b/mm/migrate.c    |  175 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    2 
 3 files changed, 180 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.794000261 -0700
+++ b/mm/internal.h	2021-03-31 15:17:11.816000261 -0700
@@ -515,12 +515,17 @@ static inline void mminit_validate_memmo
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
 				unsigned int order)
 {
 	return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.798000261 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:11.821000261 -0700
@@ -1163,6 +1163,44 @@ out:
 	return rc;
 }
 
+
+/*
+ * node_demotion[] example:
+ *
+ * Consider a system with two sockets.  Each socket has
+ * three classes of memory attached: fast, medium and slow.
+ * Each memory class is placed in its own NUMA node.  The
+ * CPUs are placed in the node with the "fast" memory.  The
+ * 6 NUMA nodes (0-5) might be split among the sockets like
+ * this:
+ *
+ *	Socket A: 0, 1, 2
+ *	Socket B: 3, 4, 5
+ *
+ * When Node 0 fills up, its memory should be migrated to
+ * Node 1.  When Node 1 fills up, it should be migrated to
+ * Node 2.  The migration path start on the nodes with the
+ * processors (since allocations default to this node) and
+ * fast memory, progress through medium and end with the
+ * slow memory:
+ *
+ *	0 -> 1 -> 2 -> stop
+ *	3 -> 4 -> 5 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ *
+ *	{  1, // Node 0 migrates to 1
+ *	   2, // Node 1 migrates to 2
+ *	  -1, // Node 2 does not migrate
+ *	   4, // Node 3 migrates to 4
+ *	   5, // Node 4 migrates to 5
+ *	  -1} // Node 5 does not migrate
+ */
+
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
@@ -1177,7 +1215,13 @@ static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
-	return node_demotion[node];
+	/*
+	 * node_demotion[] is updated without excluding
+	 * this function from running.  READ_ONCE() avoids
+	 * reading multiple, inconsistent 'node' values
+	 * during an update.
+	 */
+	return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3181,3 +3225,132 @@ void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+static void __set_migration_target_nodes(void)
+{
+	nodemask_t next_pass	= NODE_MASK_NONE;
+	nodemask_t this_pass	= NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.  Do this only
+	 * once per pass so that multiple source nodes can
+	 * share a target node.
+	 *
+	 * 'used_targets' will become unavailable in future
+	 * passes.  This limits some opportunities for
+	 * multiple source nodes to share a destination.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+}
+
+/*
+ * For callers that do not hold get_online_mems() already.
+ */
+__maybe_unused // <- temporay to prevent warnings during bisects
+static void set_migration_target_nodes(void)
+{
+	get_online_mems();
+	__set_migration_target_nodes();
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.811000261 -0700
+++ b/mm/page_alloc.c	2021-03-31 15:17:11.826000261 -0700
@@ -5780,7 +5780,7 @@ static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
  2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
  2021-04-01 18:32 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-08  9:52   ` Oscar Salvador
  2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, shy828301, weixugc, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology.  If the system changes, so must
the migration ordering.

The implementation is conceptually simple and entirely unoptimized.
On any memory or CPU hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that the
node_demotion[] array is always ready to be used in case the new
reclaim mode is enabled.

This recalculation is far from optimal, most glaringly that it does
not even attempt to figure out the hotplug event would have some
*actual* effect on the demotion order.  But, given the expected
paucity of hotplug events, this should be fine.

=== What does RCU provide? ===

Imaginge a simple loop which walks down the demotion path looking
for the last node:

        terminal_node = start_node;
        while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                terminal_node = node_demotion[terminal_node];
        }

The initial values are:

        node_demotion[0] = 1;
        node_demotion[1] = NUMA_NO_NODE;

and are updated to:

        node_demotion[0] = NUMA_NO_NODE;
        node_demotion[1] = 0;

What guarantees that the loop did not observe:

        node_demotion[0] = 1;
        node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the
loop.  Since the write side does a synchronize_rcu(), the loop
that observed the old contents is known to be complete after the
synchronize_rcu() has completed.

RCU, combined with disable_all_migrate_targets(), ensures that
the old migration state is not visible by the time
__set_migration_target_nodes() is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering
successive reads of node_demotion[].  This ensures that any
updates are *eventually* observed.

Consider the above loop again.  The compiler could theoretically
read the entirety of node_demotion[] into local storage
(registers) and never go back to memory, and *permanently*
observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes since 20210302:
 * remove duplicate synchronize_rcu()
---

 b/mm/migrate.c |  152 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 129 insertions(+), 23 deletions(-)

diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2021-03-31 15:17:13.056000258 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:13.062000258 -0700
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -1198,8 +1199,12 @@ out:
  */
 
 /*
- * Writes to this array occur without locking.  READ_ONCE()
- * is recommended for readers to ensure consistent reads.
+ * Writes to this array occur without locking.  Cycles are
+ * not allowed: Node X demotes to Y which demotes to X...
+ *
+ * If multiple reads are performed, a single rcu_read_lock()
+ * must be held over all reads to ensure that no cycles are
+ * observed.
  */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
@@ -1215,13 +1220,22 @@ static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
+	int target;
+
 	/*
-	 * node_demotion[] is updated without excluding
-	 * this function from running.  READ_ONCE() avoids
-	 * reading multiple, inconsistent 'node' values
-	 * during an update.
+	 * node_demotion[] is updated without excluding this
+	 * function from running.  RCU doesn't provide any
+	 * compiler barriers, so the READ_ONCE() is required
+	 * to avoid compiler reordering or read merging.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
 	 */
-	return READ_ONCE(node_demotion[node]);
+	rcu_read_lock();
+	target = READ_ONCE(node_demotion[node]);
+	rcu_read_unlock();
+
+	return target;
 }
 
 /*
@@ -3226,8 +3240,9 @@ void migrate_vma_finalize(struct migrate
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
 
+#if defined(CONFIG_MEMORY_HOTPLUG)
 /* Disable reclaim-based migration. */
-static void disable_all_migrate_targets(void)
+static void __disable_all_migrate_targets(void)
 {
 	int node;
 
@@ -3235,6 +3250,25 @@ static void disable_all_migrate_targets(
 		node_demotion[node] = NUMA_NO_NODE;
 }
 
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	synchronize_rcu();
+}
+
 /*
  * Find an automatic demotion target for 'node'.
  * Failing here is OK.  It might just indicate
@@ -3297,20 +3331,6 @@ static void __set_migration_target_nodes
 	disable_all_migrate_targets();
 
 	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	smp_wmb();
-
-	/*
 	 * Allocations go close to CPUs, first.  Assume that
 	 * the migration path starts at the nodes with CPUs.
 	 */
@@ -3347,10 +3367,96 @@ again:
 /*
  * For callers that do not hold get_online_mems() already.
  */
-__maybe_unused // <- temporay to prevent warnings during bisects
 static void set_migration_target_nodes(void)
 {
 	get_online_mems();
 	__set_migration_target_nodes();
 	put_online_mems();
 }
+
+/*
+ * React to hotplug events that might affect the migration targets
+ * like events that online or offline NUMA nodes.
+ *
+ * The ordering is also currently dependent on which nodes have
+ * CPUs.  That means we need CPU on/offline notification too.
+ */
+static int migration_online_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
+static int migration_offline_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
+/*
+ * This leaves migrate-on-reclaim transiently disabled between
+ * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
+ * whether reclaim-based migration is enabled or not, which
+ * ensures that the user can turn reclaim-based migration at
+ * any time without needing to recalculate migration targets.
+ *
+ * These callbacks already hold get_online_mems().  That is why
+ * __set_migration_target_nodes() can be used as opposed to
+ * set_migration_target_nodes().
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		__set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		__set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static int __init migrate_on_reclaim_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim",
+				migration_online_cpu,
+				migration_offline_cpu);
+	/*
+	 * In the unlikely case that this fails, the automatic
+	 * migration targets may become suboptimal for nodes
+	 * where N_CPU changes.  With such a small impact in a
+	 * rare case, do not bother trying to do anything special.
+	 */
+	WARN_ON(ret < 0);
+
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (2 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-01 22:35   ` Wei Xu
                     ` (2 more replies)
  2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, yang.shi, shy828301, weixugc,
	ying.huang, dan.j.williams, david, osalvador


From: Yang Shi <yang.shi@linux.alibaba.com>

The migrate_pages() returns the number of pages that were not migrated,
or an error code.  When returning an error code, there is no way to know
how many pages were migrated or not migrated.

In the following patch, migrate_pages() is used to demote pages to PMEM
node, we need account how many pages are reclaimed (demoted) since page
reclaim behavior depends on this.  Add *nr_succeeded parameter to make
migrate_pages() return how many pages are demoted successfully for all
cases.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Note: Yang Shi originally wrote the patch, thus the SoB.  There was
also a Reviewed-by provided since there were some modifications made
to this after the original work.

Changes since 20210302:
 * Fix definition of CONFIG_MIGRATION=n stub migrate_pages().  Its
   parameters were wrong, but oddly enough did not generate any
   compile errors.

Changes since 20200122:
 * Fix migrate_pages() to manipulate nr_succeeded *value*
   rather than the pointer.
---

 b/include/linux/migrate.h |    5 +++--
 b/mm/compaction.c         |    3 ++-
 b/mm/gup.c                |    3 ++-
 b/mm/memory-failure.c     |    4 +++-
 b/mm/memory_hotplug.c     |    4 +++-
 b/mm/mempolicy.c          |    8 ++++++--
 b/mm/migrate.c            |   19 +++++++++++--------
 b/mm/page_alloc.c         |    9 ++++++---
 8 files changed, 36 insertions(+), 19 deletions(-)

diff -puN include/linux/migrate.h~migrate_pages-add-success-return include/linux/migrate.h
--- a/include/linux/migrate.h~migrate_pages-add-success-return	2021-03-31 15:17:14.144000255 -0700
+++ b/include/linux/migrate.h	2021-03-31 15:17:14.182000255 -0700
@@ -40,7 +40,8 @@ extern int migrate_page(struct address_s
 			struct page *newpage, struct page *page,
 			enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
-		unsigned long private, enum migrate_mode mode, int reason);
+		unsigned long private, enum migrate_mode mode, int reason,
+		unsigned int *nr_succeeded);
 extern struct page *alloc_migration_target(struct page *page, unsigned long private);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
@@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(str
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
-		int reason)
+		int reason, unsigned int *nr_succeeded)
 	{ return -ENOSYS; }
 static inline struct page *alloc_migration_target(struct page *page,
 		unsigned long private)
diff -puN mm/compaction.c~migrate_pages-add-success-return mm/compaction.c
--- a/mm/compaction.c~migrate_pages-add-success-return	2021-03-31 15:17:14.146000255 -0700
+++ b/mm/compaction.c	2021-03-31 15:17:14.186000255 -0700
@@ -2247,6 +2247,7 @@ compact_zone(struct compact_control *cc,
 	unsigned long last_migrated_pfn;
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 	bool update_cached;
+	unsigned int nr_succeeded = 0;
 
 	/*
 	 * These counters track activities during zone compaction.  Initialize
@@ -2364,7 +2365,7 @@ compact_zone(struct compact_control *cc,
 
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
-				MR_COMPACTION);
+				MR_COMPACTION, &nr_succeeded);
 
 		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
 							&cc->migratepages);
diff -puN mm/gup.c~migrate_pages-add-success-return mm/gup.c
--- a/mm/gup.c~migrate_pages-add-success-return	2021-03-31 15:17:14.150000255 -0700
+++ b/mm/gup.c	2021-03-31 15:17:14.190000255 -0700
@@ -1550,6 +1550,7 @@ static long check_and_migrate_cma_pages(
 	unsigned long i;
 	unsigned long step;
 	bool drain_allow = true;
+	unsigned int nr_succeeded = 0;
 	bool migrate_allow = true;
 	LIST_HEAD(cma_page_list);
 	long ret = nr_pages;
@@ -1606,7 +1607,7 @@ check_again:
 				put_page(pages[i]);
 
 		if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, &nr_succeeded)) {
 			/*
 			 * some of the pages failed migration. Do get_user_pages
 			 * without migration.
diff -puN mm/memory-failure.c~migrate_pages-add-success-return mm/memory-failure.c
--- a/mm/memory-failure.c~migrate_pages-add-success-return	2021-03-31 15:17:14.155000255 -0700
+++ b/mm/memory-failure.c	2021-03-31 15:17:14.194000255 -0700
@@ -1809,6 +1809,7 @@ static int __soft_offline_page(struct pa
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
 	char const *msg_page[] = {"page", "hugepage"};
+	unsigned int nr_succeeded = 0;
 	bool huge = PageHuge(page);
 	LIST_HEAD(pagelist);
 	struct migration_target_control mtc = {
@@ -1852,7 +1853,8 @@ static int __soft_offline_page(struct pa
 
 	if (isolate_page(hpage, &pagelist)) {
 		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE,
+			&nr_succeeded);
 		if (!ret) {
 			bool release = !huge;
 
diff -puN mm/memory_hotplug.c~migrate_pages-add-success-return mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~migrate_pages-add-success-return	2021-03-31 15:17:14.160000255 -0700
+++ b/mm/memory_hotplug.c	2021-03-31 15:17:14.197000255 -0700
@@ -1392,6 +1392,7 @@ do_migrate_range(unsigned long start_pfn
 	unsigned long pfn;
 	struct page *page, *head;
 	int ret = 0;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(source);
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -1466,7 +1467,8 @@ do_migrate_range(unsigned long start_pfn
 		if (nodes_empty(nmask))
 			node_set(mtc.nid, nmask);
 		ret = migrate_pages(&source, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
+			&nr_succeeded);
 		if (ret) {
 			list_for_each_entry(page, &source, lru) {
 				pr_warn("migrating pfn %lx failed ret:%d ",
diff -puN mm/mempolicy.c~migrate_pages-add-success-return mm/mempolicy.c
--- a/mm/mempolicy.c~migrate_pages-add-success-return	2021-03-31 15:17:14.163000255 -0700
+++ b/mm/mempolicy.c	2021-03-31 15:17:14.203000255 -0700
@@ -1081,6 +1081,7 @@ static int migrate_page_add(struct page
 static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 			   int flags)
 {
+	unsigned int nr_succeeded = 0;
 	nodemask_t nmask;
 	LIST_HEAD(pagelist);
 	int err = 0;
@@ -1103,7 +1104,8 @@ static int migrate_to_node(struct mm_str
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, alloc_migration_target, NULL,
-				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
+				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
+				&nr_succeeded);
 		if (err)
 			putback_movable_pages(&pagelist);
 	}
@@ -1278,6 +1280,7 @@ static long do_mbind(unsigned long start
 		     nodemask_t *nmask, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
+	unsigned int nr_succeeded = 0;
 	struct mempolicy *new;
 	unsigned long end;
 	int err;
@@ -1355,7 +1358,8 @@ static long do_mbind(unsigned long start
 		if (!list_empty(&pagelist)) {
 			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_page, NULL,
-				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
+				&nr_succeeded);
 			if (nr_failed)
 				putback_movable_pages(&pagelist);
 		}
diff -puN mm/migrate.c~migrate_pages-add-success-return mm/migrate.c
--- a/mm/migrate.c~migrate_pages-add-success-return	2021-03-31 15:17:14.168000255 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:14.207000255 -0700
@@ -1493,6 +1493,7 @@ static inline int try_split_thp(struct p
  * @mode:		The migration mode that specifies the constraints for
  *			page migration, if any.
  * @reason:		The reason for page migration.
+ * @nr_succeeded:	The number of pages migrated successfully.
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
@@ -1503,12 +1504,11 @@ static inline int try_split_thp(struct p
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		free_page_t put_new_page, unsigned long private,
-		enum migrate_mode mode, int reason)
+		enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
 {
 	int retry = 1;
 	int thp_retry = 1;
 	int nr_failed = 0;
-	int nr_succeeded = 0;
 	int nr_thp_succeeded = 0;
 	int nr_thp_failed = 0;
 	int nr_thp_split = 0;
@@ -1611,10 +1611,10 @@ retry:
 			case MIGRATEPAGE_SUCCESS:
 				if (is_thp) {
 					nr_thp_succeeded++;
-					nr_succeeded += nr_subpages;
+					*nr_succeeded += nr_subpages;
 					break;
 				}
-				nr_succeeded++;
+				(*nr_succeeded)++;
 				break;
 			default:
 				/*
@@ -1643,12 +1643,12 @@ out:
 	 */
 	list_splice(&ret_pages, from);
 
-	count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
 	count_vm_events(PGMIGRATE_FAIL, nr_failed);
 	count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
 	count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
 	count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
-	trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
+	trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
 			       nr_thp_failed, nr_thp_split, mode, reason);
 
 	if (!swapwrite)
@@ -1716,6 +1716,7 @@ static int store_status(int __user *stat
 static int do_move_pages_to_node(struct mm_struct *mm,
 		struct list_head *pagelist, int node)
 {
+	unsigned int nr_succeeded = 0;
 	int err;
 	struct migration_target_control mtc = {
 		.nid = node,
@@ -1723,7 +1724,8 @@ static int do_move_pages_to_node(struct
 	};
 
 	err = migrate_pages(pagelist, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
+			&nr_succeeded);
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -2207,6 +2209,7 @@ int migrate_misplaced_page(struct page *
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
 	int nr_remaining;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(migratepages);
 
 	/*
@@ -2230,7 +2233,7 @@ int migrate_misplaced_page(struct page *
 	list_add(&page->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
 				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+				     MR_NUMA_MISPLACED, &nr_succeeded);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
diff -puN mm/page_alloc.c~migrate_pages-add-success-return mm/page_alloc.c
--- a/mm/page_alloc.c~migrate_pages-add-success-return	2021-03-31 15:17:14.178000255 -0700
+++ b/mm/page_alloc.c	2021-03-31 15:17:14.213000255 -0700
@@ -8452,7 +8452,8 @@ static unsigned long pfn_max_align_up(un
 
 /* [start, end) must belong to a single zone. */
 static int __alloc_contig_migrate_range(struct compact_control *cc,
-					unsigned long start, unsigned long end)
+					unsigned long start, unsigned long end,
+					unsigned int *nr_succeeded)
 {
 	/* This function is based on compact_zone() from compaction.c. */
 	unsigned int nr_reclaimed;
@@ -8490,7 +8491,8 @@ static int __alloc_contig_migrate_range(
 		cc->nr_migratepages -= nr_reclaimed;
 
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
-				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
+				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
+				nr_succeeded);
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
@@ -8526,6 +8528,7 @@ int alloc_contig_range(unsigned long sta
 	unsigned long outer_start, outer_end;
 	unsigned int order;
 	int ret = 0;
+	unsigned int nr_succeeded = 0;
 
 	struct compact_control cc = {
 		.nr_migratepages = 0,
@@ -8580,7 +8583,7 @@ int alloc_contig_range(unsigned long sta
 	 * allocated.  So, if we fall through be sure to clear ret so that
 	 * -EBUSY is not accidentally used or returned to caller.
 	 */
-	ret = __alloc_contig_migrate_range(&cc, start, end);
+	ret = __alloc_contig_migrate_range(&cc, start, end, &nr_succeeded);
 	if (ret && ret != -EBUSY)
 		goto done;
 	ret =0;
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 05/10] mm/migrate: demote pages during reclaim
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (3 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-01 20:01   ` Yang Shi
                     ` (2 more replies)
  2021-04-01 18:32 ` [PATCH 06/10] mm/vmscan: add page demotion counter Dave Hansen
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, weixugc, yang.shi, rientjes,
	ying.huang, dan.j.williams, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

This is mostly derived from a patch from Yang Shi:

	https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/

Add code to the reclaim path (shrink_page_list()) to "demote" data
to another NUMA node instead of discarding the data.  This always
avoids the cost of I/O needed to read the page back in and sometimes
avoids the writeout cost when the pagee is dirty.

A second pass through shrink_page_list() will be made if any demotions
fail.  This essentally falls back to normal reclaim behavior in the
case that demotions fail.  Previous versions of this patch may have
simply failed to reclaim pages which were eligible for demotion but
were unable to be demoted in practice.

Note: This just adds the start of infratructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: osalvador <osalvador@suse.de>

--
changes from 20210122:
 * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE (Ying)

changes from 202010:
 * add MR_NUMA_MISPLACED to trace MIGRATE_REASON define
 * make migrate_demote_page_ok() static, remove 'sc' arg until
   later patch
 * remove unnecessary alloc_demote_page() hugetlb warning
 * Simplify alloc_demote_page() gfp mask.  Depend on
   __GFP_NORETRY to make it lightweight instead of fancier
   stuff like leaving out __GFP_IO/FS.
 * Allocate migration page with alloc_migration_target()
   instead of allocating directly.
changes from 20200730:
 * Add another pass through shrink_page_list() when demotion
   fails.
changes from 20210302:
 * Use __GFP_THISNODE and revise the comment explaining the
   GFP mask constructionn
---

 b/include/linux/migrate.h        |    9 ++++
 b/include/trace/events/migrate.h |    3 -
 b/mm/vmscan.c                    |   82 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~demote-with-migrate_pages include/linux/migrate.h
--- a/include/linux/migrate.h~demote-with-migrate_pages	2021-03-31 15:17:15.842000251 -0700
+++ b/include/linux/migrate.h	2021-03-31 15:17:15.853000251 -0700
@@ -27,6 +27,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -196,6 +197,14 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int next_demotion_node(int node);
+
+#else /* CONFIG_MIGRATION disabled: */
+
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
 
 #endif /* CONFIG_MIGRATION */
 
diff -puN include/trace/events/migrate.h~demote-with-migrate_pages include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~demote-with-migrate_pages	2021-03-31 15:17:15.846000251 -0700
+++ b/include/trace/events/migrate.h	2021-03-31 15:17:15.853000251 -0700
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTION,	"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c
--- a/mm/vmscan.c~demote-with-migrate_pages	2021-03-31 15:17:15.848000251 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:15.856000251 -0700
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/migrate.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
 #include <linux/oom.h>
@@ -1035,6 +1036,23 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+static bool migrate_demote_page_ok(struct page *page)
+{
+	int next_nid = next_demotion_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid == NUMA_NO_NODE)
+		return false;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return false;
+
+	// FIXME: actually enable this later in the series
+	return false;
+}
+
 /* Check if a page is dirty or under writeback */
 static void page_check_dirty_writeback(struct page *page,
 				       bool *dirty, bool *writeback)
@@ -1065,6 +1083,46 @@ static void page_check_dirty_writeback(s
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+static struct page *alloc_demote_page(struct page *page, unsigned long node)
+{
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail the quickly and quietly.
+		 * When this happens, 'page; will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY |
+			    __GFP_THISNODE	 | __GFP_NOWARN,
+		.nid = node
+	};
+
+	return alloc_migration_target(page, (unsigned long)&mtc);
+}
+
+/*
+ * Take pages on @demote_list and attempt to demote them to
+ * another node.  Pages which are not demoted are left on
+ * @demote_pages.
+ */
+static unsigned int demote_page_list(struct list_head *demote_pages,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc)
+{
+	int target_nid = next_demotion_node(pgdat->node_id);
+	unsigned int nr_succeeded = 0;
+	int err;
+
+	if (list_empty(demote_pages))
+		return 0;
+
+	/* Demotion ignores all cpuset and mempolicy settings */
+	err = migrate_pages(demote_pages, alloc_demote_page, NULL,
+			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
+			    &nr_succeeded);
+
+	return nr_succeeded;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1076,12 +1134,15 @@ static unsigned int shrink_page_list(str
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(demote_pages);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
+	bool do_demote_pass = true;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 
+retry:
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
 		struct page *page;
@@ -1231,6 +1292,16 @@ static unsigned int shrink_page_list(str
 		}
 
 		/*
+		 * Before reclaiming the page, try to relocate
+		 * its contents to another node.
+		 */
+		if (do_demote_pass && migrate_demote_page_ok(page)) {
+			list_add(&page->lru, &demote_pages);
+			unlock_page(page);
+			continue;
+		}
+
+		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 * Lazyfree page could be freed directly
@@ -1480,6 +1551,17 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	/* 'page_list' is always empty here */
+
+	/* Migrate pages selected for demotion */
+	nr_reclaimed += demote_page_list(&demote_pages, pgdat, sc);
+	/* Pages that could not be demoted are still in @demote_pages */
+	if (!list_empty(&demote_pages)) {
+		/* Pages which failed to demoted go back on @page_list for retry: */
+		list_splice_init(&demote_pages, page_list);
+		do_demote_pass = false;
+		goto retry;
+	}
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 06/10] mm/vmscan: add page demotion counter
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (4 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-10  3:40   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages Dave Hansen
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, yang.shi, shy828301, weixugc,
	rientjes, ying.huang, dan.j.williams, david, osalvador


From: Yang Shi <yang.shi@linux.alibaba.com>

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
   - __count_vm_events() a bit, and made them look at the THP
     size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes since 202010:
 * remove unused scan-control 'demoted' field
---

 b/include/linux/vm_event_item.h |    2 ++
 b/mm/vmscan.c                   |    5 +++++
 b/mm/vmstat.c                   |    2 ++
 3 files changed, 9 insertions(+)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter	2021-03-31 15:17:17.079000248 -0700
+++ b/include/linux/vm_event_item.h	2021-03-31 15:17:17.101000248 -0700
@@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		PGREUSE,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGDEMOTE_KSWAPD,
+		PGDEMOTE_DIRECT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter	2021-03-31 15:17:17.081000248 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:17.109000248 -0700
@@ -1120,6 +1120,11 @@ static unsigned int demote_page_list(str
 			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
 			    &nr_succeeded);
 
+	if (current_is_kswapd())
+		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
+	else
+		__count_vm_events(PGDEMOTE_DIRECT, nr_succeeded);
+
 	return nr_succeeded;
 }
 
diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter	2021-03-31 15:17:17.092000248 -0700
+++ b/mm/vmstat.c	2021-03-31 15:17:17.116000248 -0700
@@ -1259,6 +1259,8 @@ const char * const vmstat_text[] = {
 	"pgreuse",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (5 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 06/10] mm/vmscan: add page demotion counter Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-07 18:40   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap Dave Hansen
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, shy828301, gthelen, weixugc, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

Anonymous pages are kept on their own LRU(s).  These lists could
theoretically always be scanned and maintained.  But, without swap,
there is currently nothing the kernel can *do* with the results of a
scanned, sorted LRU for anonymous pages.

A check for '!total_swap_pages' currently serves as a valid check as
to whether anonymous LRUs should be maintained.  However, another
method will be added shortly: page demotion.

Abstract out the 'total_swap_pages' checks into a helper, give it a
logically significant name, and check for the possibility of page
demotion.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
---

 b/mm/vmscan.c |   28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged	2021-03-31 15:17:18.325000245 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:18.333000245 -0700
@@ -2508,6 +2508,26 @@ out:
 	}
 }
 
+/*
+ * Anonymous LRU management is a waste if there is
+ * ultimately no way to reclaim the memory.
+ */
+bool anon_should_be_aged(struct lruvec *lruvec)
+{
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	/* Aging the anon LRU is valuable if swap is present: */
+	if (total_swap_pages > 0)
+		return true;
+
+	/* Also valuable if anon pages can be demoted: */
+	if (next_demotion_node(pgdat->node_id) >= 0)
+		return true;
+
+	/* No way to reclaim anon pages.  Should not age anon LRUs: */
+	return false;
+}
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2617,7 +2637,8 @@ static void shrink_lruvec(struct lruvec
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+	if (anon_should_be_aged(lruvec) &&
+	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 }
@@ -3446,10 +3467,11 @@ static void age_active_anon(struct pglis
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	if (!total_swap_pages)
+	lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+	if (!anon_should_be_aged(lruvec))
 		return;
 
-	lruvec = mem_cgroup_lruvec(NULL, pgdat);
 	if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		return;
 
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (6 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-02  0:55   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 09/10] mm/vmscan: never demote for memcg reclaim Dave Hansen
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, kbusch, shy828301, weixugc, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Keith Busch <kbusch@kernel.org>

Reclaim anonymous pages if a migration path is available now that
demotion provides a non-swap recourse for reclaiming anon pages.

Note that this check is subtly different from the
anon_should_be_aged() checks.  This mechanism checks whether a
specific page in a specific context *can* actually be reclaimed, given
current swap space and cgroup limits

anon_should_be_aged() is a much simpler and more preliminary check
which just says whether there is a possibility of future reclaim.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes from Dave 10/2020:
 * remove 'total_swap_pages' modification

Changes from Dave 06/2020:
 * rename reclaim_anon_pages()->can_reclaim_anon_pages()

Note: Keith's Intel SoB is commented out because he is no
longer at Intel and his @intel.com mail will bounce.
---

 b/mm/vmscan.c |   35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap	2021-03-31 15:17:19.388000242 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:19.407000242 -0700
@@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
 }
 #endif
 
+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+					  int node_id)
+{
+	if (memcg == NULL) {
+		/*
+		 * For non-memcg reclaim, is there
+		 * space in any swap device?
+		 */
+		if (get_nr_swap_pages() > 0)
+			return true;
+	} else {
+		/* Is the memcg below its swap limit? */
+		if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+			return true;
+	}
+
+	/*
+	 * The page can not be swapped.
+	 *
+	 * Can it be reclaimed from this node via demotion?
+	 */
+	if (next_demotion_node(node_id) >= 0)
+		return true;
+
+	/* No way to reclaim anon pages */
+	return false;
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2323,6 +2351,7 @@ enum scan_balance {
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			   unsigned long *nr)
 {
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
 	int swappiness = mem_cgroup_swappiness(memcg);
@@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2708,7 +2737,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 09/10] mm/vmscan: never demote for memcg reclaim
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (7 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-02  0:18   ` Wei Xu
  2021-04-01 18:32 ` [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration Dave Hansen
  2021-04-16 12:35 ` [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Michal Hocko
  10 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, yang.shi, shy828301, weixugc,
	rientjes, ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes.  Migrating pages to another
node serves this purpose.

memcg reclaim is different.  Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes.  Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
---

 b/mm/vmscan.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim	2021-03-31 15:17:20.476000239 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:20.487000239 -0700
@@ -288,7 +288,8 @@ static bool writeback_throttling_sane(st
 #endif
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-					  int node_id)
+					  int node_id,
+					  struct scan_control *sc)
 {
 	if (memcg == NULL) {
 		/*
@@ -326,7 +327,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -1064,7 +1065,8 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
-static bool migrate_demote_page_ok(struct page *page)
+static bool migrate_demote_page_ok(struct page *page,
+				   struct scan_control *sc)
 {
 	int next_nid = next_demotion_node(page_to_nid(page));
 
@@ -1072,6 +1074,10 @@ static bool migrate_demote_page_ok(struc
 	VM_BUG_ON_PAGE(PageHuge(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	/* It is pointless to do demotion in memcg reclaim */
+	if (cgroup_reclaim(sc))
+		return false;
+
 	if (next_nid == NUMA_NO_NODE)
 		return false;
 	if (PageTransHuge(page) && !thp_migration_supported())
@@ -1328,7 +1334,7 @@ retry:
 		 * Before reclaiming the page, try to relocate
 		 * its contents to another node.
 		 */
-		if (do_demote_pass && migrate_demote_page_ok(page)) {
+		if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
 			list_add(&page->lru, &demote_pages);
 			unlock_page(page);
 			continue;
@@ -2362,7 +2368,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2737,7 +2743,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (8 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 09/10] mm/vmscan: never demote for memcg reclaim Dave Hansen
@ 2021-04-01 18:32 ` Dave Hansen
  2021-04-01 20:06   ` Yang Shi
  2021-04-10  4:10   ` Wei Xu
  2021-04-16 12:35 ` [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Michal Hocko
  10 siblings, 2 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 18:32 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Dave Hansen, weixugc, yang.shi, rientjes,
	ying.huang, dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This proposes extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

Once this is enabled page demotion may move data to a NUMA node
that does not fall into the cpuset of the allocating process.
This could be construed to violate the guarantees of cpusets.
However, since this is an opt-in mechanism, the assumption is
that anyone enabling it is content to relax the guarantees.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

Changes since 20200122:
 * Changelog material about relaxing cpuset constraints

Changes since 20210304:
 * Add Documentation/ material about relaxing cpuset constraints
---

 b/Documentation/admin-guide/sysctl/vm.rst |   12 ++++++++++++
 b/include/linux/swap.h                    |    3 ++-
 b/include/uapi/linux/mempolicy.h          |    1 +
 b/mm/vmscan.c                             |    6 ++++--
 4 files changed, 19 insertions(+), 3 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE	2021-03-31 15:17:40.324000190 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2021-03-31 15:17:40.349000190 -0700
@@ -976,6 +976,7 @@ This is value OR'ed together of
 1	Zone reclaim on
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -1000,3 +1001,14 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.  It may move data to a NUMA node that does not
+fall into the cpuset of the allocating process which might be construed
+to violate the guarantees of cpusets.  This should not be enabled on
+systems which need strict cpuset location guarantees.
diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
--- a/include/linux/swap.h~RECLAIM_MIGRATE	2021-03-31 15:17:40.331000190 -0700
+++ b/include/linux/swap.h	2021-03-31 15:17:40.351000190 -0700
@@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
 static inline bool node_reclaim_enabled(void)
 {
 	/* Is any node_reclaim_mode bit set? */
-	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
+	return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
+				    RECLAIM_UNMAP|RECLAIM_MIGRATE);
 }
 
 extern void check_move_unevictable_pages(struct pagevec *pvec);
diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h
--- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE	2021-03-31 15:17:40.337000190 -0700
+++ b/include/uapi/linux/mempolicy.h	2021-03-31 15:17:40.352000190 -0700
@@ -71,5 +71,6 @@ enum {
 #define RECLAIM_ZONE	(1<<0)	/* Run shrink_inactive_list on the zone */
 #define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
 #define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1<<3)	/* Migrate to other nodes during reclaim */
 
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
--- a/mm/vmscan.c~RECLAIM_MIGRATE	2021-03-31 15:17:40.339000190 -0700
+++ b/mm/vmscan.c	2021-03-31 15:17:40.357000190 -0700
@@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc
 	VM_BUG_ON_PAGE(PageHuge(page), page);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	if (!(node_reclaim_mode & RECLAIM_MIGRATE))
+		return false;
+
 	/* It is pointless to do demotion in memcg reclaim */
 	if (cgroup_reclaim(sc))
 		return false;
@@ -1083,8 +1086,7 @@ static bool migrate_demote_page_ok(struc
 	if (PageTransHuge(page) && !thp_migration_supported())
 		return false;
 
-	// FIXME: actually enable this later in the series
-	return false;
+	return true;
 }
 
 /* Check if a page is dirty or under writeback */
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/10] mm/migrate: demote pages during reclaim
  2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
@ 2021-04-01 20:01   ` Yang Shi
  2021-04-01 22:58     ` Dave Hansen
  2021-04-08 10:47   ` Oscar Salvador
  2021-04-10  3:35   ` Wei Xu
  2 siblings, 1 reply; 64+ messages in thread
From: Yang Shi @ 2021-04-01 20:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, weixugc, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, Oscar Salvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This is mostly derived from a patch from Yang Shi:
>
>         https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
>
> Add code to the reclaim path (shrink_page_list()) to "demote" data
> to another NUMA node instead of discarding the data.  This always
> avoids the cost of I/O needed to read the page back in and sometimes
> avoids the writeout cost when the pagee is dirty.

s/pagee/page

>
> A second pass through shrink_page_list() will be made if any demotions
> fail.  This essentally falls back to normal reclaim behavior in the

s/essentally/essentially

> case that demotions fail.  Previous versions of this patch may have
> simply failed to reclaim pages which were eligible for demotion but
> were unable to be demoted in practice.
>
> Note: This just adds the start of infratructure for migration. It is

s/infratructure/infrastructure

> actually disabled next to the FIXME in migrate_demote_page_ok().
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
> changes from 20210122:
>  * move from GFP_HIGHUSER -> GFP_HIGHUSER_MOVABLE (Ying)
>
> changes from 202010:
>  * add MR_NUMA_MISPLACED to trace MIGRATE_REASON define
>  * make migrate_demote_page_ok() static, remove 'sc' arg until
>    later patch
>  * remove unnecessary alloc_demote_page() hugetlb warning
>  * Simplify alloc_demote_page() gfp mask.  Depend on
>    __GFP_NORETRY to make it lightweight instead of fancier
>    stuff like leaving out __GFP_IO/FS.
>  * Allocate migration page with alloc_migration_target()
>    instead of allocating directly.
> changes from 20200730:
>  * Add another pass through shrink_page_list() when demotion
>    fails.
> changes from 20210302:
>  * Use __GFP_THISNODE and revise the comment explaining the
>    GFP mask constructionn

Other than some typos above this patch looks good to me. Reviewed-by:
Yang Shi <shy828301@gmail.com>

Another nit below:

> ---
>
>  b/include/linux/migrate.h        |    9 ++++
>  b/include/trace/events/migrate.h |    3 -
>  b/mm/vmscan.c                    |   82 +++++++++++++++++++++++++++++++++++++++
>  3 files changed, 93 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/migrate.h~demote-with-migrate_pages include/linux/migrate.h
> --- a/include/linux/migrate.h~demote-with-migrate_pages 2021-03-31 15:17:15.842000251 -0700
> +++ b/include/linux/migrate.h   2021-03-31 15:17:15.853000251 -0700
> @@ -27,6 +27,7 @@ enum migrate_reason {
>         MR_MEMPOLICY_MBIND,
>         MR_NUMA_MISPLACED,
>         MR_CONTIG_RANGE,
> +       MR_DEMOTION,
>         MR_TYPES
>  };
>
> @@ -196,6 +197,14 @@ struct migrate_vma {
>  int migrate_vma_setup(struct migrate_vma *args);
>  void migrate_vma_pages(struct migrate_vma *migrate);
>  void migrate_vma_finalize(struct migrate_vma *migrate);
> +int next_demotion_node(int node);
> +
> +#else /* CONFIG_MIGRATION disabled: */
> +
> +static inline int next_demotion_node(int node)
> +{
> +       return NUMA_NO_NODE;
> +}
>
>  #endif /* CONFIG_MIGRATION */
>
> diff -puN include/trace/events/migrate.h~demote-with-migrate_pages include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~demote-with-migrate_pages  2021-03-31 15:17:15.846000251 -0700
> +++ b/include/trace/events/migrate.h    2021-03-31 15:17:15.853000251 -0700
> @@ -20,7 +20,8 @@
>         EM( MR_SYSCALL,         "syscall_or_cpuset")            \
>         EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")              \
>         EM( MR_NUMA_MISPLACED,  "numa_misplaced")               \
> -       EMe(MR_CONTIG_RANGE,    "contig_range")
> +       EM( MR_CONTIG_RANGE,    "contig_range")                 \
> +       EMe(MR_DEMOTION,        "demotion")
>
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c
> --- a/mm/vmscan.c~demote-with-migrate_pages     2021-03-31 15:17:15.848000251 -0700
> +++ b/mm/vmscan.c       2021-03-31 15:17:15.856000251 -0700
> @@ -41,6 +41,7 @@
>  #include <linux/kthread.h>
>  #include <linux/freezer.h>
>  #include <linux/memcontrol.h>
> +#include <linux/migrate.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
>  #include <linux/oom.h>
> @@ -1035,6 +1036,23 @@ static enum page_references page_check_r
>         return PAGEREF_RECLAIM;
>  }
>
> +static bool migrate_demote_page_ok(struct page *page)
> +{
> +       int next_nid = next_demotion_node(page_to_nid(page));
> +
> +       VM_BUG_ON_PAGE(!PageLocked(page), page);
> +       VM_BUG_ON_PAGE(PageHuge(page), page);
> +       VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +       if (next_nid == NUMA_NO_NODE)
> +               return false;
> +       if (PageTransHuge(page) && !thp_migration_supported())
> +               return false;
> +
> +       // FIXME: actually enable this later in the series
> +       return false;
> +}
> +
>  /* Check if a page is dirty or under writeback */
>  static void page_check_dirty_writeback(struct page *page,
>                                        bool *dirty, bool *writeback)
> @@ -1065,6 +1083,46 @@ static void page_check_dirty_writeback(s
>                 mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
>  }
>
> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +{
> +       struct migration_target_control mtc = {
> +               /*
> +                * Allocate from 'node', or fail the quickly and quietly.
> +                * When this happens, 'page; will likely just be discarded

s/'page;/'page'

> +                * instead of migrated.
> +                */
> +               .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY |
> +                           __GFP_THISNODE       | __GFP_NOWARN,
> +               .nid = node
> +       };
> +
> +       return alloc_migration_target(page, (unsigned long)&mtc);
> +}
> +
> +/*
> + * Take pages on @demote_list and attempt to demote them to
> + * another node.  Pages which are not demoted are left on
> + * @demote_pages.
> + */
> +static unsigned int demote_page_list(struct list_head *demote_pages,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc)
> +{
> +       int target_nid = next_demotion_node(pgdat->node_id);
> +       unsigned int nr_succeeded = 0;
> +       int err;
> +
> +       if (list_empty(demote_pages))
> +               return 0;
> +
> +       /* Demotion ignores all cpuset and mempolicy settings */
> +       err = migrate_pages(demote_pages, alloc_demote_page, NULL,
> +                           target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> +                           &nr_succeeded);
> +
> +       return nr_succeeded;
> +}
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> @@ -1076,12 +1134,15 @@ static unsigned int shrink_page_list(str
>  {
>         LIST_HEAD(ret_pages);
>         LIST_HEAD(free_pages);
> +       LIST_HEAD(demote_pages);
>         unsigned int nr_reclaimed = 0;
>         unsigned int pgactivate = 0;
> +       bool do_demote_pass = true;
>
>         memset(stat, 0, sizeof(*stat));
>         cond_resched();
>
> +retry:
>         while (!list_empty(page_list)) {
>                 struct address_space *mapping;
>                 struct page *page;
> @@ -1231,6 +1292,16 @@ static unsigned int shrink_page_list(str
>                 }
>
>                 /*
> +                * Before reclaiming the page, try to relocate
> +                * its contents to another node.
> +                */
> +               if (do_demote_pass && migrate_demote_page_ok(page)) {
> +                       list_add(&page->lru, &demote_pages);
> +                       unlock_page(page);
> +                       continue;
> +               }
> +
> +               /*
>                  * Anonymous process memory has backing store?
>                  * Try to allocate it some swap space here.
>                  * Lazyfree page could be freed directly
> @@ -1480,6 +1551,17 @@ keep:
>                 list_add(&page->lru, &ret_pages);
>                 VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>         }
> +       /* 'page_list' is always empty here */
> +
> +       /* Migrate pages selected for demotion */
> +       nr_reclaimed += demote_page_list(&demote_pages, pgdat, sc);
> +       /* Pages that could not be demoted are still in @demote_pages */
> +       if (!list_empty(&demote_pages)) {
> +               /* Pages which failed to demoted go back on @page_list for retry: */
> +               list_splice_init(&demote_pages, page_list);
> +               do_demote_pass = false;
> +               goto retry;
> +       }
>
>         pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>
> _
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration
  2021-04-01 18:32 ` [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration Dave Hansen
@ 2021-04-01 20:06   ` Yang Shi
  2021-04-10  4:10   ` Wei Xu
  1 sibling, 0 replies; 64+ messages in thread
From: Yang Shi @ 2021-04-01 20:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, weixugc, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand,
	Oscar Salvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> Once this is enabled page demotion may move data to a NUMA node
> that does not fall into the cpuset of the allocating process.
> This could be construed to violate the guarantees of cpusets.
> However, since this is an opt-in mechanism, the assumption is
> that anyone enabling it is content to relax the guarantees.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> Changes since 20200122:
>  * Changelog material about relaxing cpuset constraints
>
> Changes since 20210304:
>  * Add Documentation/ material about relaxing cpuset constraints

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |   12 ++++++++++++
>  b/include/linux/swap.h                    |    3 ++-
>  b/include/uapi/linux/mempolicy.h          |    1 +
>  b/mm/vmscan.c                             |    6 ++++--
>  4 files changed, 19 insertions(+), 3 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE   2021-03-31 15:17:40.324000190 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst   2021-03-31 15:17:40.349000190 -0700
> @@ -976,6 +976,7 @@ This is value OR'ed together of
>  1      Zone reclaim on
>  2      Zone reclaim writes dirty pages out
>  4      Zone reclaim swaps pages
> +8      Zone reclaim migrates pages
>  =      ===================================
>
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -1000,3 +1001,14 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.  It may move data to a NUMA node that does not
> +fall into the cpuset of the allocating process which might be construed
> +to violate the guarantees of cpusets.  This should not be enabled on
> +systems which need strict cpuset location guarantees.
> diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h
> --- a/include/linux/swap.h~RECLAIM_MIGRATE      2021-03-31 15:17:40.331000190 -0700
> +++ b/include/linux/swap.h      2021-03-31 15:17:40.351000190 -0700
> @@ -382,7 +382,8 @@ extern int sysctl_min_slab_ratio;
>  static inline bool node_reclaim_enabled(void)
>  {
>         /* Is any node_reclaim_mode bit set? */
> -       return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
> +       return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE|
> +                                   RECLAIM_UNMAP|RECLAIM_MIGRATE);
>  }
>
>  extern void check_move_unevictable_pages(struct pagevec *pvec);
> diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h
> --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE    2021-03-31 15:17:40.337000190 -0700
> +++ b/include/uapi/linux/mempolicy.h    2021-03-31 15:17:40.352000190 -0700
> @@ -71,5 +71,6 @@ enum {
>  #define RECLAIM_ZONE   (1<<0)  /* Run shrink_inactive_list on the zone */
>  #define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
>  #define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE        (1<<3)  /* Migrate to other nodes during reclaim */
>
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
> --- a/mm/vmscan.c~RECLAIM_MIGRATE       2021-03-31 15:17:40.339000190 -0700
> +++ b/mm/vmscan.c       2021-03-31 15:17:40.357000190 -0700
> @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc
>         VM_BUG_ON_PAGE(PageHuge(page), page);
>         VM_BUG_ON_PAGE(PageLRU(page), page);
>
> +       if (!(node_reclaim_mode & RECLAIM_MIGRATE))
> +               return false;
> +
>         /* It is pointless to do demotion in memcg reclaim */
>         if (cgroup_reclaim(sc))
>                 return false;
> @@ -1083,8 +1086,7 @@ static bool migrate_demote_page_ok(struc
>         if (PageTransHuge(page) && !thp_migration_supported())
>                 return false;
>
> -       // FIXME: actually enable this later in the series
> -       return false;
> +       return true;
>  }
>
>  /* Check if a page is dirty or under writeback */
> _
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
@ 2021-04-01 22:35   ` Wei Xu
  2021-04-01 23:21     ` Dave Hansen
  2021-04-01 22:39   ` Wei Xu
  2021-04-08 10:14   ` Oscar Salvador
  2 siblings, 1 reply; 64+ messages in thread
From: Wei Xu @ 2021-04-01 22:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, shy828301, ying.huang,
	Dan Williams, david, osalvador

[-- Attachment #1: Type: text/plain, Size: 16193 bytes --]

A small suggestion: Given that migrate_pages() requires that
*nr_succeeded should be initialized to 0 when it is called due to its
use of *nr_succeeded in count_vm_events() and trace_mm_migrate_pages(),
it would be less error-prone if migrate_pages() initializes
*nr_succeeded itself.

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com>
wrote:

>
> From: Yang Shi <yang.shi@linux.alibaba.com>
>
> The migrate_pages() returns the number of pages that were not migrated,
> or an error code.  When returning an error code, there is no way to know
> how many pages were migrated or not migrated.
>
> In the following patch, migrate_pages() is used to demote pages to PMEM
> node, we need account how many pages are reclaimed (demoted) since page
> reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> migrate_pages() return how many pages are demoted successfully for all
> cases.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> Note: Yang Shi originally wrote the patch, thus the SoB.  There was
> also a Reviewed-by provided since there were some modifications made
> to this after the original work.
>
> Changes since 20210302:
>  * Fix definition of CONFIG_MIGRATION=n stub migrate_pages().  Its
>    parameters were wrong, but oddly enough did not generate any
>    compile errors.
>
> Changes since 20200122:
>  * Fix migrate_pages() to manipulate nr_succeeded *value*
>    rather than the pointer.
> ---
>
>  b/include/linux/migrate.h |    5 +++--
>  b/mm/compaction.c         |    3 ++-
>  b/mm/gup.c                |    3 ++-
>  b/mm/memory-failure.c     |    4 +++-
>  b/mm/memory_hotplug.c     |    4 +++-
>  b/mm/mempolicy.c          |    8 ++++++--
>  b/mm/migrate.c            |   19 +++++++++++--------
>  b/mm/page_alloc.c         |    9 ++++++---
>  8 files changed, 36 insertions(+), 19 deletions(-)
>
> diff -puN include/linux/migrate.h~migrate_pages-add-success-return
> include/linux/migrate.h
> --- a/include/linux/migrate.h~migrate_pages-add-success-return  2021-03-31
> 15:17:14.144000255 -0700
> +++ b/include/linux/migrate.h   2021-03-31 15:17:14.182000255 -0700
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_s
>                         struct page *newpage, struct page *page,
>                         enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t
> free,
> -               unsigned long private, enum migrate_mode mode, int reason);
> +               unsigned long private, enum migrate_mode mode, int reason,
> +               unsigned int *nr_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned
> long private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(str
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>                 free_page_t free, unsigned long private, enum migrate_mode
> mode,
> -               int reason)
> +               int reason, unsigned int *nr_succeeded)
>         { return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
>                 unsigned long private)
> diff -puN mm/compaction.c~migrate_pages-add-success-return mm/compaction.c
> --- a/mm/compaction.c~migrate_pages-add-success-return  2021-03-31
> 15:17:14.146000255 -0700
> +++ b/mm/compaction.c   2021-03-31 15:17:14.186000255 -0700
> @@ -2247,6 +2247,7 @@ compact_zone(struct compact_control *cc,
>         unsigned long last_migrated_pfn;
>         const bool sync = cc->mode != MIGRATE_ASYNC;
>         bool update_cached;
> +       unsigned int nr_succeeded = 0;
>
>         /*
>          * These counters track activities during zone compaction.
> Initialize
> @@ -2364,7 +2365,7 @@ compact_zone(struct compact_control *cc,
>
>                 err = migrate_pages(&cc->migratepages, compaction_alloc,
>                                 compaction_free, (unsigned long)cc,
> cc->mode,
> -                               MR_COMPACTION);
> +                               MR_COMPACTION, &nr_succeeded);
>
>                 trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>                                                         &cc->migratepages);
> diff -puN mm/gup.c~migrate_pages-add-success-return mm/gup.c
> --- a/mm/gup.c~migrate_pages-add-success-return 2021-03-31
> 15:17:14.150000255 -0700
> +++ b/mm/gup.c  2021-03-31 15:17:14.190000255 -0700
> @@ -1550,6 +1550,7 @@ static long check_and_migrate_cma_pages(
>         unsigned long i;
>         unsigned long step;
>         bool drain_allow = true;
> +       unsigned int nr_succeeded = 0;
>         bool migrate_allow = true;
>         LIST_HEAD(cma_page_list);
>         long ret = nr_pages;
> @@ -1606,7 +1607,7 @@ check_again:
>                                 put_page(pages[i]);
>
>                 if (migrate_pages(&cma_page_list, alloc_migration_target,
> NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_CONTIG_RANGE)) {
> +                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_CONTIG_RANGE, &nr_succeeded)) {
>                         /*
>                          * some of the pages failed migration. Do
> get_user_pages
>                          * without migration.
> diff -puN mm/memory-failure.c~migrate_pages-add-success-return
> mm/memory-failure.c
> --- a/mm/memory-failure.c~migrate_pages-add-success-return      2021-03-31
> 15:17:14.155000255 -0700
> +++ b/mm/memory-failure.c       2021-03-31 15:17:14.194000255 -0700
> @@ -1809,6 +1809,7 @@ static int __soft_offline_page(struct pa
>         unsigned long pfn = page_to_pfn(page);
>         struct page *hpage = compound_head(page);
>         char const *msg_page[] = {"page", "hugepage"};
> +       unsigned int nr_succeeded = 0;
>         bool huge = PageHuge(page);
>         LIST_HEAD(pagelist);
>         struct migration_target_control mtc = {
> @@ -1852,7 +1853,8 @@ static int __soft_offline_page(struct pa
>
>         if (isolate_page(hpage, &pagelist)) {
>                 ret = migrate_pages(&pagelist, alloc_migration_target,
> NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_MEMORY_FAILURE);
> +                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_MEMORY_FAILURE,
> +                       &nr_succeeded);
>                 if (!ret) {
>                         bool release = !huge;
>
> diff -puN mm/memory_hotplug.c~migrate_pages-add-success-return
> mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~migrate_pages-add-success-return      2021-03-31
> 15:17:14.160000255 -0700
> +++ b/mm/memory_hotplug.c       2021-03-31 15:17:14.197000255 -0700
> @@ -1392,6 +1392,7 @@ do_migrate_range(unsigned long start_pfn
>         unsigned long pfn;
>         struct page *page, *head;
>         int ret = 0;
> +       unsigned int nr_succeeded = 0;
>         LIST_HEAD(source);
>
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> @@ -1466,7 +1467,8 @@ do_migrate_range(unsigned long start_pfn
>                 if (nodes_empty(nmask))
>                         node_set(mtc.nid, nmask);
>                 ret = migrate_pages(&source, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_MEMORY_HOTPLUG);
> +                       (unsigned long)&mtc, MIGRATE_SYNC,
> MR_MEMORY_HOTPLUG,
> +                       &nr_succeeded);
>                 if (ret) {
>                         list_for_each_entry(page, &source, lru) {
>                                 pr_warn("migrating pfn %lx failed ret:%d ",
> diff -puN mm/mempolicy.c~migrate_pages-add-success-return mm/mempolicy.c
> --- a/mm/mempolicy.c~migrate_pages-add-success-return   2021-03-31
> 15:17:14.163000255 -0700
> +++ b/mm/mempolicy.c    2021-03-31 15:17:14.203000255 -0700
> @@ -1081,6 +1081,7 @@ static int migrate_page_add(struct page
>  static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>                            int flags)
>  {
> +       unsigned int nr_succeeded = 0;
>         nodemask_t nmask;
>         LIST_HEAD(pagelist);
>         int err = 0;
> @@ -1103,7 +1104,8 @@ static int migrate_to_node(struct mm_str
>
>         if (!list_empty(&pagelist)) {
>                 err = migrate_pages(&pagelist, alloc_migration_target,
> NULL,
> -                               (unsigned long)&mtc, MIGRATE_SYNC,
> MR_SYSCALL);
> +                               (unsigned long)&mtc, MIGRATE_SYNC,
> MR_SYSCALL,
> +                               &nr_succeeded);
>                 if (err)
>                         putback_movable_pages(&pagelist);
>         }
> @@ -1278,6 +1280,7 @@ static long do_mbind(unsigned long start
>                      nodemask_t *nmask, unsigned long flags)
>  {
>         struct mm_struct *mm = current->mm;
> +       unsigned int nr_succeeded = 0;
>         struct mempolicy *new;
>         unsigned long end;
>         int err;
> @@ -1355,7 +1358,8 @@ static long do_mbind(unsigned long start
>                 if (!list_empty(&pagelist)) {
>                         WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>                         nr_failed = migrate_pages(&pagelist, new_page,
> NULL,
> -                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> +                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
> +                               &nr_succeeded);
>                         if (nr_failed)
>                                 putback_movable_pages(&pagelist);
>                 }
> diff -puN mm/migrate.c~migrate_pages-add-success-return mm/migrate.c
> --- a/mm/migrate.c~migrate_pages-add-success-return     2021-03-31
> 15:17:14.168000255 -0700
> +++ b/mm/migrate.c      2021-03-31 15:17:14.207000255 -0700
> @@ -1493,6 +1493,7 @@ static inline int try_split_thp(struct p
>   * @mode:              The migration mode that specifies the constraints
> for
>   *                     page migration, if any.
>   * @reason:            The reason for page migration.
> + * @nr_succeeded:      The number of pages migrated successfully.
>   *
>   * The function returns after 10 attempts or if no pages are movable any
> more
>   * because the list has become empty or no retryable pages exist any more.
> @@ -1503,12 +1504,11 @@ static inline int try_split_thp(struct p
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>                 free_page_t put_new_page, unsigned long private,
> -               enum migrate_mode mode, int reason)
> +               enum migrate_mode mode, int reason, unsigned int
> *nr_succeeded)
>  {
>         int retry = 1;
>         int thp_retry = 1;
>         int nr_failed = 0;
> -       int nr_succeeded = 0;
>         int nr_thp_succeeded = 0;
>         int nr_thp_failed = 0;
>         int nr_thp_split = 0;
> @@ -1611,10 +1611,10 @@ retry:
>                         case MIGRATEPAGE_SUCCESS:
>                                 if (is_thp) {
>                                         nr_thp_succeeded++;
> -                                       nr_succeeded += nr_subpages;
> +                                       *nr_succeeded += nr_subpages;
>                                         break;
>                                 }
> -                               nr_succeeded++;
> +                               (*nr_succeeded)++;
>                                 break;
>                         default:
>                                 /*
> @@ -1643,12 +1643,12 @@ out:
>          */
>         list_splice(&ret_pages, from);
>
> -       count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> +       count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
>         count_vm_events(PGMIGRATE_FAIL, nr_failed);
>         count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
>         count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
>         count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> -       trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> +       trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
>                                nr_thp_failed, nr_thp_split, mode, reason);
>
>         if (!swapwrite)
> @@ -1716,6 +1716,7 @@ static int store_status(int __user *stat
>  static int do_move_pages_to_node(struct mm_struct *mm,
>                 struct list_head *pagelist, int node)
>  {
> +       unsigned int nr_succeeded = 0;
>         int err;
>         struct migration_target_control mtc = {
>                 .nid = node,
> @@ -1723,7 +1724,8 @@ static int do_move_pages_to_node(struct
>         };
>
>         err = migrate_pages(pagelist, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +                       &nr_succeeded);
>         if (err)
>                 putback_movable_pages(pagelist);
>         return err;
> @@ -2207,6 +2209,7 @@ int migrate_misplaced_page(struct page *
>         pg_data_t *pgdat = NODE_DATA(node);
>         int isolated;
>         int nr_remaining;
> +       unsigned int nr_succeeded = 0;
>         LIST_HEAD(migratepages);
>
>         /*
> @@ -2230,7 +2233,7 @@ int migrate_misplaced_page(struct page *
>         list_add(&page->lru, &migratepages);
>         nr_remaining = migrate_pages(&migratepages,
> alloc_misplaced_dst_page,
>                                      NULL, node, MIGRATE_ASYNC,
> -                                    MR_NUMA_MISPLACED);
> +                                    MR_NUMA_MISPLACED, &nr_succeeded);
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> diff -puN mm/page_alloc.c~migrate_pages-add-success-return mm/page_alloc.c
> --- a/mm/page_alloc.c~migrate_pages-add-success-return  2021-03-31
> 15:17:14.178000255 -0700
> +++ b/mm/page_alloc.c   2021-03-31 15:17:14.213000255 -0700
> @@ -8452,7 +8452,8 @@ static unsigned long pfn_max_align_up(un
>
>  /* [start, end) must belong to a single zone. */
>  static int __alloc_contig_migrate_range(struct compact_control *cc,
> -                                       unsigned long start, unsigned long
> end)
> +                                       unsigned long start, unsigned long
> end,
> +                                       unsigned int *nr_succeeded)
>  {
>         /* This function is based on compact_zone() from compaction.c. */
>         unsigned int nr_reclaimed;
> @@ -8490,7 +8491,8 @@ static int __alloc_contig_migrate_range(
>                 cc->nr_migratepages -= nr_reclaimed;
>
>                 ret = migrate_pages(&cc->migratepages,
> alloc_migration_target,
> -                               NULL, (unsigned long)&mtc, cc->mode,
> MR_CONTIG_RANGE);
> +                               NULL, (unsigned long)&mtc, cc->mode,
> MR_CONTIG_RANGE,
> +                               nr_succeeded);
>         }
>         if (ret < 0) {
>                 putback_movable_pages(&cc->migratepages);
> @@ -8526,6 +8528,7 @@ int alloc_contig_range(unsigned long sta
>         unsigned long outer_start, outer_end;
>         unsigned int order;
>         int ret = 0;
> +       unsigned int nr_succeeded = 0;
>
>         struct compact_control cc = {
>                 .nr_migratepages = 0,
> @@ -8580,7 +8583,7 @@ int alloc_contig_range(unsigned long sta
>          * allocated.  So, if we fall through be sure to clear ret so that
>          * -EBUSY is not accidentally used or returned to caller.
>          */
> -       ret = __alloc_contig_migrate_range(&cc, start, end);
> +       ret = __alloc_contig_migrate_range(&cc, start, end, &nr_succeeded);
>         if (ret && ret != -EBUSY)
>                 goto done;
>         ret =0;
> _
>

[-- Attachment #2: Type: text/html, Size: 19988 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
  2021-04-01 22:35   ` Wei Xu
@ 2021-04-01 22:39   ` Wei Xu
  2021-04-08 10:14   ` Oscar Salvador
  2 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-01 22:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, shy828301, ying.huang,
	Dan Williams, david, osalvador

A small suggestion: Given that migrate_pages() requires that
*nr_succeeded should be initialized to 0 when it is called due to its
use of *nr_succeeded in count_vm_events() and trace_mm_migrate_pages(),
it would be less error-prone if migrate_pages() initializes
*nr_succeeded itself.

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Yang Shi <yang.shi@linux.alibaba.com>
>
> The migrate_pages() returns the number of pages that were not migrated,
> or an error code.  When returning an error code, there is no way to know
> how many pages were migrated or not migrated.
>
> In the following patch, migrate_pages() is used to demote pages to PMEM
> node, we need account how many pages are reclaimed (demoted) since page
> reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> migrate_pages() return how many pages are demoted successfully for all
> cases.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> Note: Yang Shi originally wrote the patch, thus the SoB.  There was
> also a Reviewed-by provided since there were some modifications made
> to this after the original work.
>
> Changes since 20210302:
>  * Fix definition of CONFIG_MIGRATION=n stub migrate_pages().  Its
>    parameters were wrong, but oddly enough did not generate any
>    compile errors.
>
> Changes since 20200122:
>  * Fix migrate_pages() to manipulate nr_succeeded *value*
>    rather than the pointer.
> ---
>
>  b/include/linux/migrate.h |    5 +++--
>  b/mm/compaction.c         |    3 ++-
>  b/mm/gup.c                |    3 ++-
>  b/mm/memory-failure.c     |    4 +++-
>  b/mm/memory_hotplug.c     |    4 +++-
>  b/mm/mempolicy.c          |    8 ++++++--
>  b/mm/migrate.c            |   19 +++++++++++--------
>  b/mm/page_alloc.c         |    9 ++++++---
>  8 files changed, 36 insertions(+), 19 deletions(-)
>
> diff -puN include/linux/migrate.h~migrate_pages-add-success-return include/linux/migrate.h
> --- a/include/linux/migrate.h~migrate_pages-add-success-return  2021-03-31 15:17:14.144000255 -0700
> +++ b/include/linux/migrate.h   2021-03-31 15:17:14.182000255 -0700
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_s
>                         struct page *newpage, struct page *page,
>                         enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> -               unsigned long private, enum migrate_mode mode, int reason);
> +               unsigned long private, enum migrate_mode mode, int reason,
> +               unsigned int *nr_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(str
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>                 free_page_t free, unsigned long private, enum migrate_mode mode,
> -               int reason)
> +               int reason, unsigned int *nr_succeeded)
>         { return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
>                 unsigned long private)
> diff -puN mm/compaction.c~migrate_pages-add-success-return mm/compaction.c
> --- a/mm/compaction.c~migrate_pages-add-success-return  2021-03-31 15:17:14.146000255 -0700
> +++ b/mm/compaction.c   2021-03-31 15:17:14.186000255 -0700
> @@ -2247,6 +2247,7 @@ compact_zone(struct compact_control *cc,
>         unsigned long last_migrated_pfn;
>         const bool sync = cc->mode != MIGRATE_ASYNC;
>         bool update_cached;
> +       unsigned int nr_succeeded = 0;
>
>         /*
>          * These counters track activities during zone compaction.  Initialize
> @@ -2364,7 +2365,7 @@ compact_zone(struct compact_control *cc,
>
>                 err = migrate_pages(&cc->migratepages, compaction_alloc,
>                                 compaction_free, (unsigned long)cc, cc->mode,
> -                               MR_COMPACTION);
> +                               MR_COMPACTION, &nr_succeeded);
>
>                 trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>                                                         &cc->migratepages);
> diff -puN mm/gup.c~migrate_pages-add-success-return mm/gup.c
> --- a/mm/gup.c~migrate_pages-add-success-return 2021-03-31 15:17:14.150000255 -0700
> +++ b/mm/gup.c  2021-03-31 15:17:14.190000255 -0700
> @@ -1550,6 +1550,7 @@ static long check_and_migrate_cma_pages(
>         unsigned long i;
>         unsigned long step;
>         bool drain_allow = true;
> +       unsigned int nr_succeeded = 0;
>         bool migrate_allow = true;
>         LIST_HEAD(cma_page_list);
>         long ret = nr_pages;
> @@ -1606,7 +1607,7 @@ check_again:
>                                 put_page(pages[i]);
>
>                 if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, &nr_succeeded)) {
>                         /*
>                          * some of the pages failed migration. Do get_user_pages
>                          * without migration.
> diff -puN mm/memory-failure.c~migrate_pages-add-success-return mm/memory-failure.c
> --- a/mm/memory-failure.c~migrate_pages-add-success-return      2021-03-31 15:17:14.155000255 -0700
> +++ b/mm/memory-failure.c       2021-03-31 15:17:14.194000255 -0700
> @@ -1809,6 +1809,7 @@ static int __soft_offline_page(struct pa
>         unsigned long pfn = page_to_pfn(page);
>         struct page *hpage = compound_head(page);
>         char const *msg_page[] = {"page", "hugepage"};
> +       unsigned int nr_succeeded = 0;
>         bool huge = PageHuge(page);
>         LIST_HEAD(pagelist);
>         struct migration_target_control mtc = {
> @@ -1852,7 +1853,8 @@ static int __soft_offline_page(struct pa
>
>         if (isolate_page(hpage, &pagelist)) {
>                 ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE,
> +                       &nr_succeeded);
>                 if (!ret) {
>                         bool release = !huge;
>
> diff -puN mm/memory_hotplug.c~migrate_pages-add-success-return mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~migrate_pages-add-success-return      2021-03-31 15:17:14.160000255 -0700
> +++ b/mm/memory_hotplug.c       2021-03-31 15:17:14.197000255 -0700
> @@ -1392,6 +1392,7 @@ do_migrate_range(unsigned long start_pfn
>         unsigned long pfn;
>         struct page *page, *head;
>         int ret = 0;
> +       unsigned int nr_succeeded = 0;
>         LIST_HEAD(source);
>
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> @@ -1466,7 +1467,8 @@ do_migrate_range(unsigned long start_pfn
>                 if (nodes_empty(nmask))
>                         node_set(mtc.nid, nmask);
>                 ret = migrate_pages(&source, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
> +                       &nr_succeeded);
>                 if (ret) {
>                         list_for_each_entry(page, &source, lru) {
>                                 pr_warn("migrating pfn %lx failed ret:%d ",
> diff -puN mm/mempolicy.c~migrate_pages-add-success-return mm/mempolicy.c
> --- a/mm/mempolicy.c~migrate_pages-add-success-return   2021-03-31 15:17:14.163000255 -0700
> +++ b/mm/mempolicy.c    2021-03-31 15:17:14.203000255 -0700
> @@ -1081,6 +1081,7 @@ static int migrate_page_add(struct page
>  static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>                            int flags)
>  {
> +       unsigned int nr_succeeded = 0;
>         nodemask_t nmask;
>         LIST_HEAD(pagelist);
>         int err = 0;
> @@ -1103,7 +1104,8 @@ static int migrate_to_node(struct mm_str
>
>         if (!list_empty(&pagelist)) {
>                 err = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -                               (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +                               (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +                               &nr_succeeded);
>                 if (err)
>                         putback_movable_pages(&pagelist);
>         }
> @@ -1278,6 +1280,7 @@ static long do_mbind(unsigned long start
>                      nodemask_t *nmask, unsigned long flags)
>  {
>         struct mm_struct *mm = current->mm;
> +       unsigned int nr_succeeded = 0;
>         struct mempolicy *new;
>         unsigned long end;
>         int err;
> @@ -1355,7 +1358,8 @@ static long do_mbind(unsigned long start
>                 if (!list_empty(&pagelist)) {
>                         WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>                         nr_failed = migrate_pages(&pagelist, new_page, NULL,
> -                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> +                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
> +                               &nr_succeeded);
>                         if (nr_failed)
>                                 putback_movable_pages(&pagelist);
>                 }
> diff -puN mm/migrate.c~migrate_pages-add-success-return mm/migrate.c
> --- a/mm/migrate.c~migrate_pages-add-success-return     2021-03-31 15:17:14.168000255 -0700
> +++ b/mm/migrate.c      2021-03-31 15:17:14.207000255 -0700
> @@ -1493,6 +1493,7 @@ static inline int try_split_thp(struct p
>   * @mode:              The migration mode that specifies the constraints for
>   *                     page migration, if any.
>   * @reason:            The reason for page migration.
> + * @nr_succeeded:      The number of pages migrated successfully.
>   *
>   * The function returns after 10 attempts or if no pages are movable any more
>   * because the list has become empty or no retryable pages exist any more.
> @@ -1503,12 +1504,11 @@ static inline int try_split_thp(struct p
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>                 free_page_t put_new_page, unsigned long private,
> -               enum migrate_mode mode, int reason)
> +               enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
>  {
>         int retry = 1;
>         int thp_retry = 1;
>         int nr_failed = 0;
> -       int nr_succeeded = 0;
>         int nr_thp_succeeded = 0;
>         int nr_thp_failed = 0;
>         int nr_thp_split = 0;
> @@ -1611,10 +1611,10 @@ retry:
>                         case MIGRATEPAGE_SUCCESS:
>                                 if (is_thp) {
>                                         nr_thp_succeeded++;
> -                                       nr_succeeded += nr_subpages;
> +                                       *nr_succeeded += nr_subpages;
>                                         break;
>                                 }
> -                               nr_succeeded++;
> +                               (*nr_succeeded)++;
>                                 break;
>                         default:
>                                 /*
> @@ -1643,12 +1643,12 @@ out:
>          */
>         list_splice(&ret_pages, from);
>
> -       count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> +       count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
>         count_vm_events(PGMIGRATE_FAIL, nr_failed);
>         count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
>         count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
>         count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> -       trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> +       trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
>                                nr_thp_failed, nr_thp_split, mode, reason);
>
>         if (!swapwrite)
> @@ -1716,6 +1716,7 @@ static int store_status(int __user *stat
>  static int do_move_pages_to_node(struct mm_struct *mm,
>                 struct list_head *pagelist, int node)
>  {
> +       unsigned int nr_succeeded = 0;
>         int err;
>         struct migration_target_control mtc = {
>                 .nid = node,
> @@ -1723,7 +1724,8 @@ static int do_move_pages_to_node(struct
>         };
>
>         err = migrate_pages(pagelist, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +                       &nr_succeeded);
>         if (err)
>                 putback_movable_pages(pagelist);
>         return err;
> @@ -2207,6 +2209,7 @@ int migrate_misplaced_page(struct page *
>         pg_data_t *pgdat = NODE_DATA(node);
>         int isolated;
>         int nr_remaining;
> +       unsigned int nr_succeeded = 0;
>         LIST_HEAD(migratepages);
>
>         /*
> @@ -2230,7 +2233,7 @@ int migrate_misplaced_page(struct page *
>         list_add(&page->lru, &migratepages);
>         nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
>                                      NULL, node, MIGRATE_ASYNC,
> -                                    MR_NUMA_MISPLACED);
> +                                    MR_NUMA_MISPLACED, &nr_succeeded);
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> diff -puN mm/page_alloc.c~migrate_pages-add-success-return mm/page_alloc.c
> --- a/mm/page_alloc.c~migrate_pages-add-success-return  2021-03-31 15:17:14.178000255 -0700
> +++ b/mm/page_alloc.c   2021-03-31 15:17:14.213000255 -0700
> @@ -8452,7 +8452,8 @@ static unsigned long pfn_max_align_up(un
>
>  /* [start, end) must belong to a single zone. */
>  static int __alloc_contig_migrate_range(struct compact_control *cc,
> -                                       unsigned long start, unsigned long end)
> +                                       unsigned long start, unsigned long end,
> +                                       unsigned int *nr_succeeded)
>  {
>         /* This function is based on compact_zone() from compaction.c. */
>         unsigned int nr_reclaimed;
> @@ -8490,7 +8491,8 @@ static int __alloc_contig_migrate_range(
>                 cc->nr_migratepages -= nr_reclaimed;
>
>                 ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> -                               NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> +                               NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
> +                               nr_succeeded);
>         }
>         if (ret < 0) {
>                 putback_movable_pages(&cc->migratepages);
> @@ -8526,6 +8528,7 @@ int alloc_contig_range(unsigned long sta
>         unsigned long outer_start, outer_end;
>         unsigned int order;
>         int ret = 0;
> +       unsigned int nr_succeeded = 0;
>
>         struct compact_control cc = {
>                 .nr_migratepages = 0,
> @@ -8580,7 +8583,7 @@ int alloc_contig_range(unsigned long sta
>          * allocated.  So, if we fall through be sure to clear ret so that
>          * -EBUSY is not accidentally used or returned to caller.
>          */
> -       ret = __alloc_contig_migrate_range(&cc, start, end);
> +       ret = __alloc_contig_migrate_range(&cc, start, end, &nr_succeeded);
>         if (ret && ret != -EBUSY)
>                 goto done;
>         ret =0;
> _


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/10] mm/migrate: demote pages during reclaim
  2021-04-01 20:01   ` Yang Shi
@ 2021-04-01 22:58     ` Dave Hansen
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 22:58 UTC (permalink / raw)
  To: Yang Shi, Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, weixugc, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, Oscar Salvador

On 4/1/21 1:01 PM, Yang Shi wrote:
> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>
>>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> This is mostly derived from a patch from Yang Shi:
>>
>>         https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
>>
>> Add code to the reclaim path (shrink_page_list()) to "demote" data
>> to another NUMA node instead of discarding the data.  This always
>> avoids the cost of I/O needed to read the page back in and sometimes
>> avoids the writeout cost when the pagee is dirty.
> 
> s/pagee/page
> 
>>
>> A second pass through shrink_page_list() will be made if any demotions
>> fail.  This essentally falls back to normal reclaim behavior in the
> 
> s/essentally/essentially
> 
>> case that demotions fail.  Previous versions of this patch may have
>> simply failed to reclaim pages which were eligible for demotion but
>> were unable to be demoted in practice.
>>
>> Note: This just adds the start of infratructure for migration. It is
> 
> s/infratructure/infrastructure

Thanks for finding those!  I somehow got really sloppy in this patch.  I
scanned the rest of the descriptions and don't see any more obvious
typos/misspellings.

>> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
>> +{
>> +       struct migration_target_control mtc = {
>> +               /*
>> +                * Allocate from 'node', or fail the quickly and quietly.
>> +                * When this happens, 'page; will likely just be discarded
> 
> s/'page;/'page'

Fixed.  Thanks for the review tag!



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-01 22:35   ` Wei Xu
@ 2021-04-01 23:21     ` Dave Hansen
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-01 23:21 UTC (permalink / raw)
  To: Wei Xu, Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, shy828301, ying.huang,
	Dan Williams, david, osalvador

On 4/1/21 3:35 PM, Wei Xu wrote:
> A small suggestion: Given that migrate_pages() requires that
> *nr_succeeded should be initialized to 0 when it is called due to its
> use of *nr_succeeded in count_vm_events() and trace_mm_migrate_pages(),
> it would be less error-prone if migrate_pages() initializes
> *nr_succeeded itself.

That's a good point, especially if a caller made multiple
migrate_pages() calls without resetting it, which a number of callers
do.  That could really have caused some interesting problems.  Thanks
for catching that!

I'll do what you suggested.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 09/10] mm/vmscan: never demote for memcg reclaim
  2021-04-01 18:32 ` [PATCH 09/10] mm/vmscan: never demote for memcg reclaim Dave Hansen
@ 2021-04-02  0:18   ` Wei Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-02  0:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, shy828301, David Rientjes,
	ying.huang, Dan Williams, david, osalvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Global reclaim aims to reduce the amount of memory used on
> a given node or set of nodes.  Migrating pages to another
> node serves this purpose.
>
> memcg reclaim is different.  Its goal is to reduce the
> total memory consumption of the entire memcg, across all
> nodes.  Migration does not assist memcg reclaim because
> it just moves page contents between nodes rather than
> actually reducing memory consumption.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Suggested-by: Yang Shi <yang.shi@linux.alibaba.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
> ---
>
>  b/mm/vmscan.c |   18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
> --- a/mm/vmscan.c~never-demote-for-memcg-reclaim        2021-03-31 15:17:20.476000239 -0700
> +++ b/mm/vmscan.c       2021-03-31 15:17:20.487000239 -0700
> @@ -288,7 +288,8 @@ static bool writeback_throttling_sane(st
>  #endif
>
>  static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> -                                         int node_id)
> +                                         int node_id,
> +                                         struct scan_control *sc)

I don't see "sc" gets used in can_reclaim_anon_pages().  Remove it?

>  {
>         if (memcg == NULL) {
>                 /*
> @@ -326,7 +327,7 @@ unsigned long zone_reclaimable_pages(str
>
>         nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
>                 zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -       if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
> +       if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
>                 nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
>                         zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -1064,7 +1065,8 @@ static enum page_references page_check_r
>         return PAGEREF_RECLAIM;
>  }
>
> -static bool migrate_demote_page_ok(struct page *page)
> +static bool migrate_demote_page_ok(struct page *page,
> +                                  struct scan_control *sc)
>  {
>         int next_nid = next_demotion_node(page_to_nid(page));
>
> @@ -1072,6 +1074,10 @@ static bool migrate_demote_page_ok(struc
>         VM_BUG_ON_PAGE(PageHuge(page), page);
>         VM_BUG_ON_PAGE(PageLRU(page), page);
>
> +       /* It is pointless to do demotion in memcg reclaim */
> +       if (cgroup_reclaim(sc))
> +               return false;
> +
>         if (next_nid == NUMA_NO_NODE)
>                 return false;
>         if (PageTransHuge(page) && !thp_migration_supported())
> @@ -1328,7 +1334,7 @@ retry:
>                  * Before reclaiming the page, try to relocate
>                  * its contents to another node.
>                  */
> -               if (do_demote_pass && migrate_demote_page_ok(page)) {
> +               if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
>                         list_add(&page->lru, &demote_pages);
>                         unlock_page(page);
>                         continue;
> @@ -2362,7 +2368,7 @@ static void get_scan_count(struct lruvec
>         enum lru_list lru;
>
>         /* If we have no swap space, do not bother scanning anon pages. */
> -       if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
> +       if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
>                 scan_balance = SCAN_FILE;
>                 goto out;
>         }
> @@ -2737,7 +2743,7 @@ static inline bool should_continue_recla
>          */
>         pages_for_compaction = compact_gap(sc->order);
>         inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -       if (can_reclaim_anon_pages(NULL, pgdat->node_id))
> +       if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
>                 inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
>
>         return inactive_lru_pages > pages_for_compaction;
> _


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap
  2021-04-01 18:32 ` [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap Dave Hansen
@ 2021-04-02  0:55   ` Wei Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-02  0:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, kbusch, shy828301, David Rientjes,
	ying.huang, Dan Williams, david, osalvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Keith Busch <kbusch@kernel.org>
>
> Reclaim anonymous pages if a migration path is available now that
> demotion provides a non-swap recourse for reclaiming anon pages.
>
> Note that this check is subtly different from the
> anon_should_be_aged() checks.  This mechanism checks whether a
> specific page in a specific context *can* actually be reclaimed, given
> current swap space and cgroup limits
>
> anon_should_be_aged() is a much simpler and more preliminary check
> which just says whether there is a possibility of future reclaim.
>
> #Signed-off-by: Keith Busch <keith.busch@intel.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> Changes from Dave 10/2020:
>  * remove 'total_swap_pages' modification
>
> Changes from Dave 06/2020:
>  * rename reclaim_anon_pages()->can_reclaim_anon_pages()
>
> Note: Keith's Intel SoB is commented out because he is no
> longer at Intel and his @intel.com mail will bounce.
> ---
>
>  b/mm/vmscan.c |   35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
> --- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  2021-03-31 15:17:19.388000242 -0700
> +++ b/mm/vmscan.c       2021-03-31 15:17:19.407000242 -0700
> @@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
>  }
>  #endif
>
> +static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> +                                         int node_id)
> +{
> +       if (memcg == NULL) {
> +               /*
> +                * For non-memcg reclaim, is there
> +                * space in any swap device?
> +                */
> +               if (get_nr_swap_pages() > 0)
> +                       return true;
> +       } else {
> +               /* Is the memcg below its swap limit? */
> +               if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +                       return true;
> +       }
> +
> +       /*
> +        * The page can not be swapped.
> +        *
> +        * Can it be reclaimed from this node via demotion?
> +        */
> +       if (next_demotion_node(node_id) >= 0)
> +               return true;

When neither swap space nor RECLAIM_MIGRATE is enabled, but
next_demotion_node() is configured, inactive pages cannot be swapped out
nor demoted.  However, this check can still cause these pages to be sent
to shrink_page_list() (e.g., when can_reclaim_anon_pages() is called by
get_scan_count()) and make the THP pages being unnecessarily split there.

One fix would be to guard this next_demotion_node() check with the
RECLAIM_MIGRATE node_reclaim_mode check.  This RECLAIM_MIGRATE
check needs to be applied to other calls to next_demotion_node() in
vmscan.c as well.

> +
> +       /* No way to reclaim anon pages */
> +       return false;
> +}
> +
>  /*
>   * This misses isolated pages which are not accounted for to save counters.
>   * As the data only determines if reclaim or compaction continues, it is
> @@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
>
>         nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
>                 zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -       if (get_nr_swap_pages() > 0)
> +       if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
>                 nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
>                         zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -2323,6 +2351,7 @@ enum scan_balance {
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>                            unsigned long *nr)
>  {
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         unsigned long anon_cost, file_cost, total_cost;
>         int swappiness = mem_cgroup_swappiness(memcg);
> @@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
>         enum lru_list lru;
>
>         /* If we have no swap space, do not bother scanning anon pages. */
> -       if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
> +       if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {

Demotion of anon pages still depends on sc->may_swap.  Any thoughts on
decoupling
demotion from swapping more completely?

>                 scan_balance = SCAN_FILE;
>                 goto out;
>         }
> @@ -2708,7 +2737,7 @@ static inline bool should_continue_recla
>          */
>         pages_for_compaction = compact_gap(sc->order);
>         inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -       if (get_nr_swap_pages() > 0)
> +       if (can_reclaim_anon_pages(NULL, pgdat->node_id))
>                 inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
>
>         return inactive_lru_pages > pages_for_compaction;
> _
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
  2021-04-01 18:32 ` [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages Dave Hansen
@ 2021-04-07 18:40   ` Wei Xu
  2021-04-09  8:31     ` Oscar Salvador
  0 siblings, 1 reply; 64+ messages in thread
From: Wei Xu @ 2021-04-07 18:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, Greg Thelen, David Rientjes,
	ying.huang, Dan Williams, david, osalvador

> +/*
> + * Anonymous LRU management is a waste if there is
> + * ultimately no way to reclaim the memory.
> + */
> +bool anon_should_be_aged(struct lruvec *lruvec)
> +{
> +       struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +       /* Aging the anon LRU is valuable if swap is present: */
> +       if (total_swap_pages > 0)
> +               return true;
> +
> +       /* Also valuable if anon pages can be demoted: */
> +       if (next_demotion_node(pgdat->node_id) >= 0)
> +               return true;
> +
> +       /* No way to reclaim anon pages.  Should not age anon LRUs: */
> +       return false;
> +}

anon_should_be_aged() doesn't really need "lruvec".  It essentially
answers whether the pages of the given node can be swapped or demoted.
So it would be clearer and less confusing if anon_should_be_aged()
takes "pgdat" instead of "lruvec" as the argument.  The call to
mem_cgroup_lruvec(NULL, pgdat) in age_active_anon() can then be removed
as well.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/10] mm/numa: node demotion data structure and lookup
  2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
@ 2021-04-08  8:03   ` Oscar Salvador
  2021-04-08 21:29     ` Dave Hansen
  2021-04-09  5:32   ` Wei Xu
  1 sibling, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08  8:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On Thu, Apr 01, 2021 at 11:32:18AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a user defined node migration table. This allows creating single
> migration target for each NUMA node to enable the kernel to do NUMA
> page migrations instead of simply reclaiming colder pages. A node
> with no target is a "terminal node", so reclaim acts normally there.
> The migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
> 
> If you consider the migration path as a graph, cycles (loops) in the
> graph are disallowed.  This avoids wasting resources by constantly
> migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
> never be allowed.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>

I think this patch and patch#2 could be squashed

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-01 18:32 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
@ 2021-04-08  8:26   ` Oscar Salvador
  2021-04-08 21:51     ` Dave Hansen
  2021-04-10  3:07   ` Wei Xu
  1 sibling, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08  8:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On Thu, Apr 01, 2021 at 11:32:19AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
> 
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists.  Those lists
> already tell us if memory is full where to look next.  It would
> also be logical to move memory in that order.
> 
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
> 
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
> 
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
> 
> The protocol for node_demotion[] access and writing is not
> standard.  It has no specific locking and is intended to be read
> locklessly.  Readers must take care to avoid observing changes
> that appear incoherent.  This was done so that node_demotion[]

It might be just me being dense here, but that reads odd.

"Readers must take care to avoid observing changes that appear
incoherent" - I am not sure what is that supposed to mean.

I guess you mean readers of next_demotion_node()?
And if so, how do they have to take care? And what would apply for
"incoherent" terminology here?

> locking has no chance of becoming a bottleneck on large systems
> with lots of CPUs in direct reclaim.
> 
> This code is unused for now.  It will be called later in the
> series.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>

...

> +static void __set_migration_target_nodes(void)
> +{
> +	nodemask_t next_pass	= NODE_MASK_NONE;
> +	nodemask_t this_pass	= NODE_MASK_NONE;
> +	nodemask_t used_targets = NODE_MASK_NONE;
> +	int node;
> +
> +	/*
> +	 * Avoid any oddities like cycles that could occur
> +	 * from changes in the topology.  This will leave
> +	 * a momentary gap when migration is disabled.
> +	 */
> +	disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 *
> +	 * The before+after state together might have cycles and
> +	 * could cause readers to do things like loop until this
> +	 * function finishes.  This ensures they can only see a
> +	 * single "bad" read and would, for instance, only loop
> +	 * once.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Allocations go close to CPUs, first.  Assume that
> +	 * the migration path starts at the nodes with CPUs.
> +	 */
> +	next_pass = node_states[N_CPU];
> +again:
> +	this_pass = next_pass;
> +	next_pass = NODE_MASK_NONE;
> +	/*
> +	 * To avoid cycles in the migration "graph", ensure
> +	 * that migration sources are not future targets by
> +	 * setting them in 'used_targets'.  Do this only
> +	 * once per pass so that multiple source nodes can
> +	 * share a target node.
> +	 *
> +	 * 'used_targets' will become unavailable in future
> +	 * passes.  This limits some opportunities for
> +	 * multiple source nodes to share a destination.
> +	 */
> +	nodes_or(used_targets, used_targets, this_pass);
> +	for_each_node_mask(node, this_pass) {
> +		int target_node = establish_migrate_target(node, &used_targets);
> +
> +		if (target_node == NUMA_NO_NODE)
> +			continue;
> +
> +		/* Visit targets from this pass in the next pass: */
> +		node_set(target_node, next_pass);
> +	}
> +	/* Is another pass necessary? */
> +	if (!nodes_empty(next_pass))

When I read this I was about puzzled and it took me a while to figure
out how the passes were made.
I think this could benefit from a better explanation on how the passes
are being performed e.g: why next_pass should be empty before leaving.

Other than that looks good to me.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-01 18:32 ` [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events Dave Hansen
@ 2021-04-08  9:52   ` Oscar Salvador
  2021-04-09 10:14     ` Oscar Salvador
  0 siblings, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08  9:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On Thu, Apr 01, 2021 at 11:32:21AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Reclaim-based migration is attempting to optimize data placement in
> memory based on the system topology.  If the system changes, so must
> the migration ordering.
> 
> The implementation is conceptually simple and entirely unoptimized.
> On any memory or CPU hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that the
> node_demotion[] array is always ready to be used in case the new
> reclaim mode is enabled.
> 
> This recalculation is far from optimal, most glaringly that it does
> not even attempt to figure out the hotplug event would have some
> *actual* effect on the demotion order.  But, given the expected
> paucity of hotplug events, this should be fine.
> 
> === What does RCU provide? ===
> 
> Imaginge a simple loop which walks down the demotion path looking
> for the last node:
> 
>         terminal_node = start_node;
>         while (node_demotion[terminal_node] != NUMA_NO_NODE) {
>                 terminal_node = node_demotion[terminal_node];
>         }
> 
> The initial values are:
> 
>         node_demotion[0] = 1;
>         node_demotion[1] = NUMA_NO_NODE;
> 
> and are updated to:
> 
>         node_demotion[0] = NUMA_NO_NODE;
>         node_demotion[1] = 0;
> 
> What guarantees that the loop did not observe:
> 
>         node_demotion[0] = 1;
>         node_demotion[1] = 0;
> 
> and would loop forever?
> 
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop.  Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete after the
> synchronize_rcu() has completed.
> 
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
> 
> === What does READ_ONCE() provide? ===
> 
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[].  This ensures that any
> updates are *eventually* observed.
> 
> Consider the above loop again.  The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
> 
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
> 
> 	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
> 

...
  
> +#if defined(CONFIG_MEMORY_HOTPLUG)

I am not really into PMEM, and I ignore whether we need
CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
If so, the following can be partly ignored.

I think that you either want to check CONFIG_MEMORY_HOTPLUG +
CONFIG_CPU_HOTPLUG, or just do not put it under any conf dependency.

The thing is that migrate_on_reclaim_init() will only be called if
we have CONFIG_MEMORY_HOTPLUG, and when we do not have that (but we do have
CONFIG_CPU_HOTPLUG) the calls to set_migration_target_nodes() wont't be
made when the system brings up the CPUs during the boot phase,
which means node_demotion[] list won't be initialized.

But this brings me to the next point.

From a conceptual point of view, I think you want to build the
node_demotion[] list, being orthogonal to it whether we support CPU Or
MEMORY hotplug.

Now, in case we support CPU or MEMORY hotplug, we do want to be able to re-build
the list for .e.g: in case NUMA nodes become cpu-less or memory-less.

On x86_64, CPU_HOTPLUG is enabled by default if SMP, the same for
MEMORY_HOTPLUG, but I am not sure about other archs.
Can we have !CPU_HOTPLUG && MEMORY_HOTPLUG, !MEMORY_HOTPLUG &&
CPU_HOTPLUG? I do now really know, but I think you should be careful
about that.

If this was my call, I would:

- Do not place the burden to initialize node_demotion[] list in CPU
  hotplug boot phase (or if so, be carefull because if I disable
  MEMORY_HOTPLUG, I end up with no demotion_list[])
- Diferentiate between migration_{online,offline}_cpu and
  migrate_on_reclaim_callback() and place them under their respective
  configs-dependency.

But I might be missing some details so I might be off somewhere.

Another thing that caught my eye is that we are calling
set_migration_target_nodes() for every CPU the system brings up at boot
phase. I know systems with *lots* of CPUs.
I am not sure whether we have a mechanism to delay that until all CPUs
that are meant to be online are online? (after boot?)
That's probably happening in wonderland, but was just speaking out loud.

(Of course the same happen with memory_hotplug acpi operations.
All it takes is some qemu-handling)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
  2021-04-01 22:35   ` Wei Xu
  2021-04-01 22:39   ` Wei Xu
@ 2021-04-08 10:14   ` Oscar Salvador
  2021-04-08 17:26     ` Yang Shi
  2 siblings, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08 10:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, shy828301, weixugc, ying.huang,
	dan.j.williams, david

On Thu, Apr 01, 2021 at 11:32:23AM -0700, Dave Hansen wrote:
> 
> From: Yang Shi <yang.shi@linux.alibaba.com>
> 
> The migrate_pages() returns the number of pages that were not migrated,
> or an error code.  When returning an error code, there is no way to know
> how many pages were migrated or not migrated.
> 
> In the following patch, migrate_pages() is used to demote pages to PMEM
> node, we need account how many pages are reclaimed (demoted) since page
> reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> migrate_pages() return how many pages are demoted successfully for all
> cases.
> 
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
> 

...
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		free_page_t put_new_page, unsigned long private,
> -		enum migrate_mode mode, int reason)
> +		enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
>  {
>  	int retry = 1;
>  	int thp_retry = 1;
>  	int nr_failed = 0;
> -	int nr_succeeded = 0;
>  	int nr_thp_succeeded = 0;
>  	int nr_thp_failed = 0;
>  	int nr_thp_split = 0;
> @@ -1611,10 +1611,10 @@ retry:
>  			case MIGRATEPAGE_SUCCESS:
>  				if (is_thp) {
>  					nr_thp_succeeded++;
> -					nr_succeeded += nr_subpages;
> +					*nr_succeeded += nr_subpages;
>  					break;
>  				}
> -				nr_succeeded++;
> +				(*nr_succeeded)++;
>  				break;
>  			default:
>  				/*
> @@ -1643,12 +1643,12 @@ out:
>  	 */
>  	list_splice(&ret_pages, from);
>  
> -	count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> +	count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
>  	count_vm_events(PGMIGRATE_FAIL, nr_failed);
>  	count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
>  	count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
>  	count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> -	trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> +	trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
>  			       nr_thp_failed, nr_thp_split, mode, reason);

It seems that reclaiming is the only user who cared about how many pages
could we migrated, could not do the following instead:

diff --git a/mm/migrate.c b/mm/migrate.c
index 695a594e5860..d4170b7ea2fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1503,7 +1503,7 @@ static inline int try_split_thp(struct page *page, struct page **page2,
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
                free_page_t put_new_page, unsigned long private,
-               enum migrate_mode mode, int reason)
+               enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
        int retry = 1;
        int thp_retry = 1;
@@ -1654,6 +1654,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
        if (!swapwrite)
                current->flags &= ~PF_SWAPWRITE;

+       if (ret_succedded)
+               *ret_succedded = nr_succedded;
+
        return rc;
 }

 And pass only a valid pointer from demote_page_list() and NULL from all
 the others?
 I was just wondered after all those "unsigned int nr_succedded" in all
 other functions.
 This would also solve the "be careful to initialize nr_succedded"
 problem?


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/10] mm/migrate: demote pages during reclaim
  2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
  2021-04-01 20:01   ` Yang Shi
@ 2021-04-08 10:47   ` Oscar Salvador
  2021-04-10  3:35   ` Wei Xu
  2 siblings, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08 10:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, weixugc, yang.shi, rientjes, ying.huang,
	dan.j.williams

On Thu, Apr 01, 2021 at 11:32:25AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> This is mostly derived from a patch from Yang Shi:
> 
> 	https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
> 
> Add code to the reclaim path (shrink_page_list()) to "demote" data
> to another NUMA node instead of discarding the data.  This always
> avoids the cost of I/O needed to read the page back in and sometimes
> avoids the writeout cost when the pagee is dirty.
> 
> A second pass through shrink_page_list() will be made if any demotions
> fail.  This essentally falls back to normal reclaim behavior in the
> case that demotions fail.  Previous versions of this patch may have
> simply failed to reclaim pages which were eligible for demotion but
> were unable to be demoted in practice.
> 
> Note: This just adds the start of infratructure for migration. It is
> actually disabled next to the FIXME in migrate_demote_page_ok().
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: osalvador <osalvador@suse.de>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 10:14   ` Oscar Salvador
@ 2021-04-08 17:26     ` Yang Shi
  2021-04-08 18:17       ` Oscar Salvador
  0 siblings, 1 reply; 64+ messages in thread
From: Yang Shi @ 2021-04-08 17:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 8, 2021 at 3:14 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Thu, Apr 01, 2021 at 11:32:23AM -0700, Dave Hansen wrote:
> >
> > From: Yang Shi <yang.shi@linux.alibaba.com>
> >
> > The migrate_pages() returns the number of pages that were not migrated,
> > or an error code.  When returning an error code, there is no way to know
> > how many pages were migrated or not migrated.
> >
> > In the following patch, migrate_pages() is used to demote pages to PMEM
> > node, we need account how many pages are reclaimed (demoted) since page
> > reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> > migrate_pages() return how many pages are demoted successfully for all
> > cases.
> >
> > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Reviewed-by: Yang Shi <shy828301@gmail.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Huang Ying <ying.huang@intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: osalvador <osalvador@suse.de>
> >
>
> ...
> >  int migrate_pages(struct list_head *from, new_page_t get_new_page,
> >               free_page_t put_new_page, unsigned long private,
> > -             enum migrate_mode mode, int reason)
> > +             enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
> >  {
> >       int retry = 1;
> >       int thp_retry = 1;
> >       int nr_failed = 0;
> > -     int nr_succeeded = 0;
> >       int nr_thp_succeeded = 0;
> >       int nr_thp_failed = 0;
> >       int nr_thp_split = 0;
> > @@ -1611,10 +1611,10 @@ retry:
> >                       case MIGRATEPAGE_SUCCESS:
> >                               if (is_thp) {
> >                                       nr_thp_succeeded++;
> > -                                     nr_succeeded += nr_subpages;
> > +                                     *nr_succeeded += nr_subpages;
> >                                       break;
> >                               }
> > -                             nr_succeeded++;
> > +                             (*nr_succeeded)++;
> >                               break;
> >                       default:
> >                               /*
> > @@ -1643,12 +1643,12 @@ out:
> >        */
> >       list_splice(&ret_pages, from);
> >
> > -     count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> > +     count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
> >       count_vm_events(PGMIGRATE_FAIL, nr_failed);
> >       count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
> >       count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
> >       count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> > -     trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> > +     trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
> >                              nr_thp_failed, nr_thp_split, mode, reason);
>
> It seems that reclaiming is the only user who cared about how many pages
> could we migrated, could not do the following instead:
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 695a594e5860..d4170b7ea2fe 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1503,7 +1503,7 @@ static inline int try_split_thp(struct page *page, struct page **page2,
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>                 free_page_t put_new_page, unsigned long private,
> -               enum migrate_mode mode, int reason)
> +               enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
>         int retry = 1;
>         int thp_retry = 1;
> @@ -1654,6 +1654,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>         if (!swapwrite)
>                 current->flags &= ~PF_SWAPWRITE;
>
> +       if (ret_succedded)
> +               *ret_succedded = nr_succedded;
> +
>         return rc;
>  }
>
>  And pass only a valid pointer from demote_page_list() and NULL from all
>  the others?
>  I was just wondered after all those "unsigned int nr_succedded" in all
>  other functions.
>  This would also solve the "be careful to initialize nr_succedded"
>  problem?

Thanks, Oscar. Yes, kind of. But we have to remember to initialize
"nr_succedded" pointer properly for every migrate_pages() callsite,
right? And it doesn't prevent from returning wrong value if
migrate_pages() is called multiple times by one caller although there
might be not such case (calls migrate_pages() multiple times and care
about nr_succeded) for now.

So IMHO I do prefer Wei's suggestion to have migrate_pages()
initialize nr_succeeded. This seems simpler.


>
>
> --
> Oscar Salvador
> SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 17:26     ` Yang Shi
@ 2021-04-08 18:17       ` Oscar Salvador
  2021-04-08 18:21         ` Oscar Salvador
                           ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08 18:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 08, 2021 at 10:26:54AM -0700, Yang Shi wrote:
 
> Thanks, Oscar. Yes, kind of. But we have to remember to initialize
> "nr_succedded" pointer properly for every migrate_pages() callsite,
> right? And it doesn't prevent from returning wrong value if
> migrate_pages() is called multiple times by one caller although there
> might be not such case (calls migrate_pages() multiple times and care
> about nr_succeded) for now.

Hi Yang,

I might be missing something but AFAICS you only need to initialize
nr_succeded pointer where it matters.
The local nr_succeeded in migrate_pages() doesn't go, and so it gets
initialized every time you call in it to 0.
And if you pass a valid pointer, *ret_succeeded == nr_succedeed.

I am talking about this (not even compile-tested):

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..fd661cb2ce13 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping,
 			struct page *newpage, struct page *page,
 			enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
-		unsigned long private, enum migrate_mode mode, int reason);
+		unsigned long private, enum migrate_mode mode, int reason,
+		unsigned int *ret_succeeded);
 extern struct page *alloc_migration_target(struct page *page, unsigned long private);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
@@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
-		int reason)
+		int reason, unsigned int *ret_succeeded)
 	{ return -ENOSYS; }
 static inline struct page *alloc_migration_target(struct page *page,
 		unsigned long private)
diff --git a/mm/compaction.c b/mm/compaction.c
index e04f4476e68e..7238e8faff04 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)

 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
-				MR_COMPACTION);
+				MR_COMPACTION, NULL);

 		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
 							&cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..b70d463aa1fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
 				put_page(pages[i]);

 		if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, NULL)) {
 			/*
 			 * some of the pages failed migration. Do get_user_pages
 			 * without migration.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..a17e0f039076 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page)

 	if (isolate_page(hpage, &pagelist)) {
 		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE,
+			NULL);
 		if (!ret) {
 			bool release = !huge;

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..28496376de94 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (nodes_empty(nmask))
 			node_set(mtc.nid, nmask);
 		ret = migrate_pages(&source, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
+			NULL);
 		if (ret) {
 			list_for_each_entry(page, &source, lru) {
 				pr_warn("migrating pfn %lx failed ret:%d ",
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ab51132547b8..df260ed12102 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1103,7 +1103,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,

 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, alloc_migration_target, NULL,
-				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
+				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
+				NULL);
 		if (err)
 			putback_movable_pages(&pagelist);
 	}
@@ -1355,7 +1356,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_page, NULL,
-				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
+				NULL);
 			if (nr_failed)
 				putback_movable_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 695a594e5860..087ed407b3ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1493,6 +1493,9 @@ static inline int try_split_thp(struct page *page, struct page **page2,
  * @mode:		The migration mode that specifies the constraints for
  *			page migration, if any.
  * @reason:		The reason for page migration.
+ * @ret_succeeded:	A pointer to place the value of the number of pages
+ *                      migrated successfully. The caller must pass a valid pointer
+ *                      if they care about it.
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
@@ -1503,7 +1506,7 @@ static inline int try_split_thp(struct page *page, struct page **page2,
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		free_page_t put_new_page, unsigned long private,
-		enum migrate_mode mode, int reason)
+		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
 	int retry = 1;
 	int thp_retry = 1;
@@ -1654,6 +1657,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;

+	if (ret_succeeded)
+		*ret_succeeded = nr_succeeded;
+
 	return rc;
 }

@@ -1723,7 +1729,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
 	};

 	err = migrate_pages(pagelist, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
+			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
+			NULL);
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -2230,7 +2237,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	list_add(&page->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
 				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+				     MR_NUMA_MISPLACED, NULL);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46f3d594369d..0c1bbadd5ca3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 		cc->nr_migratepages -= nr_reclaimed;

 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
-				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
+				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
+				NULL);
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);


As I said I might be missing a point here, but I cannot see the problem
you describe here.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 18:17       ` Oscar Salvador
@ 2021-04-08 18:21         ` Oscar Salvador
  2021-04-08 20:40         ` Yang Shi
  2021-04-09 15:50         ` Dave Hansen
  2 siblings, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-08 18:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 08, 2021 at 08:17:26PM +0200, Oscar Salvador wrote:
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..fd661cb2ce13 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping,
>  			struct page *newpage, struct page *page,
>  			enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> -		unsigned long private, enum migrate_mode mode, int reason);
> +		unsigned long private, enum migrate_mode mode, int reason,
> +		unsigned int *ret_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>  		free_page_t free, unsigned long private, enum migrate_mode mode,
> -		int reason)
> +		int reason, unsigned int *ret_succeeded)
>  	{ return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
>  		unsigned long private)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..7238e8faff04 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
> 
>  		err = migrate_pages(&cc->migratepages, compaction_alloc,
>  				compaction_free, (unsigned long)cc, cc->mode,
> -				MR_COMPACTION);
> +				MR_COMPACTION, NULL);
> 
>  		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>  							&cc->migratepages);
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..b70d463aa1fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>  				put_page(pages[i]);
> 
>  		if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
> -			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> +			(unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, NULL)) {
>  			/*
>  			 * some of the pages failed migration. Do get_user_pages
>  			 * without migration.
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..a17e0f039076 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page)
> 
>  	if (isolate_page(hpage, &pagelist)) {
>  		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
> +			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE,
> +			NULL);
>  		if (!ret) {
>  			bool release = !huge;
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..28496376de94 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		if (nodes_empty(nmask))
>  			node_set(mtc.nid, nmask);
>  		ret = migrate_pages(&source, alloc_migration_target, NULL,
> -			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> +			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
> +			NULL);
>  		if (ret) {
>  			list_for_each_entry(page, &source, lru) {
>  				pr_warn("migrating pfn %lx failed ret:%d ",
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ab51132547b8..df260ed12102 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1103,7 +1103,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
> 
>  	if (!list_empty(&pagelist)) {
>  		err = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +				(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +				NULL);
>  		if (err)
>  			putback_movable_pages(&pagelist);
>  	}
> @@ -1355,7 +1356,8 @@ static long do_mbind(unsigned long start, unsigned long len,
>  		if (!list_empty(&pagelist)) {
>  			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>  			nr_failed = migrate_pages(&pagelist, new_page, NULL,
> -				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> +				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
> +				NULL);
>  			if (nr_failed)
>  				putback_movable_pages(&pagelist);
>  		}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 695a594e5860..087ed407b3ce 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1493,6 +1493,9 @@ static inline int try_split_thp(struct page *page, struct page **page2,
>   * @mode:		The migration mode that specifies the constraints for
>   *			page migration, if any.
>   * @reason:		The reason for page migration.
> + * @ret_succeeded:	A pointer to place the value of the number of pages
> + *                      migrated successfully. The caller must pass a valid pointer
> + *                      if they care about it.
>   *
>   * The function returns after 10 attempts or if no pages are movable any more
>   * because the list has become empty or no retryable pages exist any more.
> @@ -1503,7 +1506,7 @@ static inline int try_split_thp(struct page *page, struct page **page2,
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  		free_page_t put_new_page, unsigned long private,
> -		enum migrate_mode mode, int reason)
> +		enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
>  	int retry = 1;
>  	int thp_retry = 1;
> @@ -1654,6 +1657,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>  	if (!swapwrite)
>  		current->flags &= ~PF_SWAPWRITE;
> 
> +	if (ret_succeeded)
> +		*ret_succeeded = nr_succeeded;
> +
>  	return rc;
>  }
> 
> @@ -1723,7 +1729,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>  	};
> 
>  	err = migrate_pages(pagelist, alloc_migration_target, NULL,
> -			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +			(unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +			NULL);
>  	if (err)
>  		putback_movable_pages(pagelist);
>  	return err;
> @@ -2230,7 +2237,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  	list_add(&page->lru, &migratepages);
>  	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
>  				     NULL, node, MIGRATE_ASYNC,
> -				     MR_NUMA_MISPLACED);
> +				     MR_NUMA_MISPLACED, NULL);
>  	if (nr_remaining) {
>  		if (!list_empty(&migratepages)) {
>  			list_del(&page->lru);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 46f3d594369d..0c1bbadd5ca3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  		cc->nr_migratepages -= nr_reclaimed;
> 
>  		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> -				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> +				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
> +				NULL);
>  	}
>  	if (ret < 0) {
>  		putback_movable_pages(&cc->migratepages);

Of course, to give a full context:

+static unsigned int demote_page_list(struct list_head *demote_pages,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc)
+{
+	int target_nid = next_demotion_node(pgdat->node_id);
+	unsigned int nr_succeeded = 0;
+	int err;
+
+	if (list_empty(demote_pages))
+		return 0;
+
+	/* Demotion ignores all cpuset and mempolicy settings */
+	err = migrate_pages(demote_pages, alloc_demote_page, NULL,
+			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
+			    &nr_succeeded);
+
+	return nr_succeeded;
+}

So, demote_page_list() would be the only function that passes a valid
pointer, instead of NULL, because it cares about the nr_succeeded.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 18:17       ` Oscar Salvador
  2021-04-08 18:21         ` Oscar Salvador
@ 2021-04-08 20:40         ` Yang Shi
  2021-04-09  5:06           ` Oscar Salvador
  2021-04-09 15:50         ` Dave Hansen
  2 siblings, 1 reply; 64+ messages in thread
From: Yang Shi @ 2021-04-08 20:40 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 8, 2021 at 11:17 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Thu, Apr 08, 2021 at 10:26:54AM -0700, Yang Shi wrote:
>
> > Thanks, Oscar. Yes, kind of. But we have to remember to initialize
> > "nr_succedded" pointer properly for every migrate_pages() callsite,
> > right? And it doesn't prevent from returning wrong value if
> > migrate_pages() is called multiple times by one caller although there
> > might be not such case (calls migrate_pages() multiple times and care
> > about nr_succeded) for now.
>
> Hi Yang,
>
> I might be missing something but AFAICS you only need to initialize
> nr_succeded pointer where it matters.
> The local nr_succeeded in migrate_pages() doesn't go, and so it gets
> initialized every time you call in it to 0.
> And if you pass a valid pointer, *ret_succeeded == nr_succedeed.
>
> I am talking about this (not even compile-tested):
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..fd661cb2ce13 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping,
>                         struct page *newpage, struct page *page,
>                         enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> -               unsigned long private, enum migrate_mode mode, int reason);
> +               unsigned long private, enum migrate_mode mode, int reason,
> +               unsigned int *ret_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>                 free_page_t free, unsigned long private, enum migrate_mode mode,
> -               int reason)
> +               int reason, unsigned int *ret_succeeded)
>         { return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
>                 unsigned long private)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..7238e8faff04 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>
>                 err = migrate_pages(&cc->migratepages, compaction_alloc,
>                                 compaction_free, (unsigned long)cc, cc->mode,
> -                               MR_COMPACTION);
> +                               MR_COMPACTION, NULL);
>
>                 trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>                                                         &cc->migratepages);
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..b70d463aa1fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct mm_struct *mm,
>                                 put_page(pages[i]);
>
>                 if (migrate_pages(&cma_page_list, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, NULL)) {
>                         /*
>                          * some of the pages failed migration. Do get_user_pages
>                          * without migration.
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..a17e0f039076 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page)
>
>         if (isolate_page(hpage, &pagelist)) {
>                 ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE,
> +                       NULL);
>                 if (!ret) {
>                         bool release = !huge;
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..28496376de94 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>                 if (nodes_empty(nmask))
>                         node_set(mtc.nid, nmask);
>                 ret = migrate_pages(&source, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
> +                       NULL);
>                 if (ret) {
>                         list_for_each_entry(page, &source, lru) {
>                                 pr_warn("migrating pfn %lx failed ret:%d ",
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ab51132547b8..df260ed12102 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1103,7 +1103,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>
>         if (!list_empty(&pagelist)) {
>                 err = migrate_pages(&pagelist, alloc_migration_target, NULL,
> -                               (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +                               (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +                               NULL);
>                 if (err)
>                         putback_movable_pages(&pagelist);
>         }
> @@ -1355,7 +1356,8 @@ static long do_mbind(unsigned long start, unsigned long len,
>                 if (!list_empty(&pagelist)) {
>                         WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>                         nr_failed = migrate_pages(&pagelist, new_page, NULL,
> -                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
> +                               start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
> +                               NULL);
>                         if (nr_failed)
>                                 putback_movable_pages(&pagelist);
>                 }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 695a594e5860..087ed407b3ce 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1493,6 +1493,9 @@ static inline int try_split_thp(struct page *page, struct page **page2,
>   * @mode:              The migration mode that specifies the constraints for
>   *                     page migration, if any.
>   * @reason:            The reason for page migration.
> + * @ret_succeeded:     A pointer to place the value of the number of pages
> + *                      migrated successfully. The caller must pass a valid pointer
> + *                      if they care about it.
>   *
>   * The function returns after 10 attempts or if no pages are movable any more
>   * because the list has become empty or no retryable pages exist any more.
> @@ -1503,7 +1506,7 @@ static inline int try_split_thp(struct page *page, struct page **page2,
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>                 free_page_t put_new_page, unsigned long private,
> -               enum migrate_mode mode, int reason)
> +               enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
>  {
>         int retry = 1;
>         int thp_retry = 1;
> @@ -1654,6 +1657,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
>         if (!swapwrite)
>                 current->flags &= ~PF_SWAPWRITE;
>
> +       if (ret_succeeded)
> +               *ret_succeeded = nr_succeeded;
> +

Thanks a lot for the example code. You didn't miss anything. At first
glance, I thought your suggestion seemed neater. Actually I
misunderstood what Dave said about "That could really have caused some
interesting problems." with multiple calls to migrate_pages(). I was
thinking about:

unsigned long foo()
{
    unsigned long *ret_succeeded;

    migrate_pages(..., ret_succeeded);

    migrate_pages(..., ret_succeeded);

    return *ret_succeeded;
}

Of course, this is a hypothetical case now. We don't have to care
about it at all.

>         return rc;
>  }
>
> @@ -1723,7 +1729,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>         };
>
>         err = migrate_pages(pagelist, alloc_migration_target, NULL,
> -                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL);
> +                       (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL,
> +                       NULL);
>         if (err)
>                 putback_movable_pages(pagelist);
>         return err;
> @@ -2230,7 +2237,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>         list_add(&page->lru, &migratepages);
>         nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
>                                      NULL, node, MIGRATE_ASYNC,
> -                                    MR_NUMA_MISPLACED);
> +                                    MR_NUMA_MISPLACED, NULL);
>         if (nr_remaining) {
>                 if (!list_empty(&migratepages)) {
>                         list_del(&page->lru);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 46f3d594369d..0c1bbadd5ca3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>                 cc->nr_migratepages -= nr_reclaimed;
>
>                 ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> -                               NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> +                               NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
> +                               NULL);
>         }
>         if (ret < 0) {
>                 putback_movable_pages(&cc->migratepages);
>
>
> As I said I might be missing a point here, but I cannot see the problem
> you describe here.
>
>
> --
> Oscar Salvador
> SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/10] mm/numa: node demotion data structure and lookup
  2021-04-08  8:03   ` Oscar Salvador
@ 2021-04-08 21:29     ` Dave Hansen
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-08 21:29 UTC (permalink / raw)
  To: Oscar Salvador, Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On 4/8/21 1:03 AM, Oscar Salvador wrote:
> I think this patch and patch#2 could be squashed
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

Yeah, that makes a lot of sense.  I'll do that for the next version.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-08  8:26   ` Oscar Salvador
@ 2021-04-08 21:51     ` Dave Hansen
  2021-04-09  8:17       ` Oscar Salvador
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-08 21:51 UTC (permalink / raw)
  To: Oscar Salvador, Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On 4/8/21 1:26 AM, Oscar Salvador wrote:
> On Thu, Apr 01, 2021 at 11:32:19AM -0700, Dave Hansen wrote:
>> The protocol for node_demotion[] access and writing is not
>> standard.  It has no specific locking and is intended to be read
>> locklessly.  Readers must take care to avoid observing changes
>> that appear incoherent.  This was done so that node_demotion[]
> 
> It might be just me being dense here, but that reads odd.
> 
> "Readers must take care to avoid observing changes that appear
> incoherent" - I am not sure what is that supposed to mean.
> 
> I guess you mean readers of next_demotion_node()?
> And if so, how do they have to take care? And what would apply for
> "incoherent" terminology here?

I've fleshed out the description a bit.  I hope this helps?

> Readers of node_demotion[] (such as next_demotion_node() callers)
> must take care to avoid observing changes that appear incoherent.
> For instance, even though no demotion cycles are allowed, it's
> possible that a cycle could be observed.
> 
> Let's say that there are three nodes, A, B and C.  node_demotion[]
> is set up to have this path:
>         
>         A -> B -> C
> 
> Suppose it was modified to instead represent this path:
> 
>         A -> C -> B
> 
> There is nothing to stop a reader from seeing B->C and then a
> moment later seeting C->B.  That *appears* to be a cycle.  This
> can be avoided with RCU and will be implemented in a later patch.

...
>> +again:
>> +	this_pass = next_pass;
>> +	next_pass = NODE_MASK_NONE;
>> +	/*
>> +	 * To avoid cycles in the migration "graph", ensure
>> +	 * that migration sources are not future targets by
>> +	 * setting them in 'used_targets'.  Do this only
>> +	 * once per pass so that multiple source nodes can
>> +	 * share a target node.
>> +	 *
>> +	 * 'used_targets' will become unavailable in future
>> +	 * passes.  This limits some opportunities for
>> +	 * multiple source nodes to share a destination.
>> +	 */
>> +	nodes_or(used_targets, used_targets, this_pass);
>> +	for_each_node_mask(node, this_pass) {
>> +		int target_node = establish_migrate_target(node, &used_targets);
>> +
>> +		if (target_node == NUMA_NO_NODE)
>> +			continue;
>> +
>> +		/* Visit targets from this pass in the next pass: */
>> +		node_set(target_node, next_pass);
>> +	}
>> +	/* Is another pass necessary? */
>> +	if (!nodes_empty(next_pass))
> 
> When I read this I was about puzzled and it took me a while to figure
> out how the passes were made.
> I think this could benefit from a better explanation on how the passes
> are being performed e.g: why next_pass should be empty before leaving.
> 
> Other than that looks good to me.
I've tried to flesh out those comments to elaborate on what is going on:

>                 /*
>                  * Visit targets from this pass in the next pass.
>                  * Eventually, every node will have been part of
>                  * a pass, and will become set in 'used_targets'.
>                  */
>                 node_set(target_node, next_pass);
>         }
>         /*
>          * 'next_pass' contains nodes which became migration
>          * targets in this pass.  Make additional passes until
>          * no more migrations targets are available.
>          */
>         if (!nodes_empty(next_pass))
>                 goto again;
> }



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 20:40         ` Yang Shi
@ 2021-04-09  5:06           ` Oscar Salvador
  2021-04-09  5:43             ` Wei Xu
  2021-04-09 15:43             ` Yang Shi
  0 siblings, 2 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-09  5:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote:
> Thanks a lot for the example code. You didn't miss anything. At first
> glance, I thought your suggestion seemed neater. Actually I
> misunderstood what Dave said about "That could really have caused some
> interesting problems." with multiple calls to migrate_pages(). I was
> thinking about:
> 
> unsigned long foo()
> {
>     unsigned long *ret_succeeded;
> 
>     migrate_pages(..., ret_succeeded);
> 
>     migrate_pages(..., ret_succeeded);
> 
>     return *ret_succeeded;
> }

But that would not be a problem as well. I mean I am not sure what is
foo() supposed to do.
I assume is supposed to return the *total* number of pages that were
migrated?

Then could do something like:

 unsigned long foo()
 {
     unsigned long ret_succeeded;
     unsigned long total_succeeded = 0;

     migrate_pages(..., &ret_succeeded);
     total_succeeded += ret_succeeded;

     migrate_pages(..., &ret_succeeded);
     total_succeeded += ret_succeeded;

     return *total_succeeded;
 }

 But AFAICS, you would have to do that with Wei Xu's version and with
 mine, no difference there.

IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning
of the function, and never reset back, which means, we would carry the
sum of previous nr_succeeded instead of the nr_succeeded in that round.
That would be misleading for e.g: reclaim in case we were to call
migrate_pages() several times, as instead of a delta value, nr_succeeded
would accumulate.

But that won't happen neither with Wei Xu's version nor with mine. 

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/10] mm/numa: node demotion data structure and lookup
  2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
  2021-04-08  8:03   ` Oscar Salvador
@ 2021-04-09  5:32   ` Wei Xu
  1 sibling, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-09  5:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, David Rientjes, ying.huang,
	Dan Williams, david, osalvador

It makes sense to start with a simple node tiering model like this
change, which looks good to me.

I do want to mention a likely usage scenario that motivates the need
for a list of nodes as the demotion target of a source node.

Access to a cross-socket DRAM node is still fast enough.  So to minimize
memory stranding, job processes can be allowed to fall back to
allocate pages from a remote DRAM node.

But cross-socket access to PMEM nodes (the slower tier) can be slow,
especially for random writes.  It is then desirable not to demote the
pages of a process to a remote PMEM node, even when the pages are on
a remote DRAM node, which has the remote PMEM node as its demotion
target.  At the same time, it is also desirable to still be able to
demote such pages when they become cold so that the more precious
DRAM occupied by these pages can be used for more active data.

To support such use cases, we need to be able to specify a list of
demotion target nodes for the remote DRAM node, which should include
the PMEM node closer to the process.  Certainly, we will also need an
ability to limit the demotion target nodes of a process (or a cgroup)
to ensure that only local PMEM nodes are eligible as the actual
demotion target.

Note that demoting a page to a remote PMEM node is more acceptable
than a process accesses the same remote PMEM node because demotion
is one-time, sequential access, and can also use non-temporal stores
to reduce the access overheads and bypass caches.

Reviewed-by: Wei Xu <weixugc@google.com>

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a user defined node migration table. This allows creating single
> migration target for each NUMA node to enable the kernel to do NUMA
> page migrations instead of simply reclaiming colder pages. A node
> with no target is a "terminal node", so reclaim acts normally there.
> The migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> If you consider the migration path as a graph, cycles (loops) in the
> graph are disallowed.  This avoids wasting resources by constantly
> migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
> never be allowed.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> changes since 20200122:
>  * Make node_demotion[] __read_mostly
>
> changes in July 2020:
>  - Remove loop from next_demotion_node() and get_online_mems().
>    This means that the node returned by next_demotion_node()
>    might now be offline, but the worst case is that the
>    allocation fails.  That's fine since it is transient.
> ---
>
>  b/mm/migrate.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
> --- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path    2021-03-31 15:17:10.734000264 -0700
> +++ b/mm/migrate.c      2021-03-31 15:17:10.742000264 -0700
> @@ -1163,6 +1163,23 @@ out:
>         return rc;
>  }
>
> +static int node_demotion[MAX_NUMNODES] __read_mostly =
> +       {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
> +
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * @returns: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +       return node_demotion[node];
> +}
> +
>  /*
>   * Obtain the lock on page, remove all ptes and migrate the page
>   * to the newly allocated page in newpage.
> _


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-09  5:06           ` Oscar Salvador
@ 2021-04-09  5:43             ` Wei Xu
  2021-04-09 15:43             ` Yang Shi
  1 sibling, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-09  5:43 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Yang Shi, Dave Hansen, Linux MM, Linux Kernel Mailing List,
	Yang Shi, Huang Ying, Dan Williams, David Hildenbrand

I agree that it is a good further improvement to make nr_succeeded an optional
output argument of migrate_pages() given that most callers don't need it.  IMHO,
the most important thing in this matter is to ensure that nr_succeeded only
returns (when its return value is needed) the successfully migrated
pages in this
round and doesn't accumulate.  This is addressed by both proposals.

On Thu, Apr 8, 2021 at 10:06 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote:
> > Thanks a lot for the example code. You didn't miss anything. At first
> > glance, I thought your suggestion seemed neater. Actually I
> > misunderstood what Dave said about "That could really have caused some
> > interesting problems." with multiple calls to migrate_pages(). I was
> > thinking about:
> >
> > unsigned long foo()
> > {
> >     unsigned long *ret_succeeded;
> >
> >     migrate_pages(..., ret_succeeded);
> >
> >     migrate_pages(..., ret_succeeded);
> >
> >     return *ret_succeeded;
> > }
>
> But that would not be a problem as well. I mean I am not sure what is
> foo() supposed to do.
> I assume is supposed to return the *total* number of pages that were
> migrated?
>
> Then could do something like:
>
>  unsigned long foo()
>  {
>      unsigned long ret_succeeded;
>      unsigned long total_succeeded = 0;
>
>      migrate_pages(..., &ret_succeeded);
>      total_succeeded += ret_succeeded;
>
>      migrate_pages(..., &ret_succeeded);
>      total_succeeded += ret_succeeded;
>
>      return *total_succeeded;
>  }
>
>  But AFAICS, you would have to do that with Wei Xu's version and with
>  mine, no difference there.
>
> IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning
> of the function, and never reset back, which means, we would carry the
> sum of previous nr_succeeded instead of the nr_succeeded in that round.
> That would be misleading for e.g: reclaim in case we were to call
> migrate_pages() several times, as instead of a delta value, nr_succeeded
> would accumulate.
>
> But that won't happen neither with Wei Xu's version nor with mine.
>
> --
> Oscar Salvador
> SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-08 21:51     ` Dave Hansen
@ 2021-04-09  8:17       ` Oscar Salvador
  0 siblings, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-09  8:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-mm, linux-kernel, shy828301, weixugc,
	rientjes, ying.huang, dan.j.williams, david

On Thu, Apr 08, 2021 at 02:51:20PM -0700, Dave Hansen wrote:
> I've fleshed out the description a bit.  I hope this helps?

Yes, thanks Dave, both additions look fine to me.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
  2021-04-07 18:40   ` Wei Xu
@ 2021-04-09  8:31     ` Oscar Salvador
  0 siblings, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-09  8:31 UTC (permalink / raw)
  To: Wei Xu
  Cc: Dave Hansen, linux-mm, linux-kernel, shy828301, Greg Thelen,
	David Rientjes, ying.huang, Dan Williams, david

On Wed, Apr 07, 2021 at 11:40:13AM -0700, Wei Xu wrote:
> anon_should_be_aged() doesn't really need "lruvec".  It essentially
> answers whether the pages of the given node can be swapped or demoted.
> So it would be clearer and less confusing if anon_should_be_aged()
> takes "pgdat" instead of "lruvec" as the argument.  The call to
> mem_cgroup_lruvec(NULL, pgdat) in age_active_anon() can then be removed
> as well.

I tend to agree with this, and I would go one step further with the naming.
For me, taking into account the nature of the function that tells us whether
we have any means to age those pages, a better fit would be something like
anon_can_be_aged(). IIUC, the "should age" part would be inactive_is_low().


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-08  9:52   ` Oscar Salvador
@ 2021-04-09 10:14     ` Oscar Salvador
  2021-04-09 10:15       ` Oscar Salvador
  2021-04-09 18:59       ` David Hildenbrand
  0 siblings, 2 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-09 10:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> I am not really into PMEM, and I ignore whether we need
> CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> If so, the following can be partly ignored.

Ok, I refreshed by memory with [1].
From that, it seems that in order to use PMEM as RAM we need CONFIG_MEMORY_HOTPLUG.
But is that always the case? Can happen that in some scenario PMEM comes ready
to use and we do not need the hotplug trick?

Anyway, I would still like to clarify the state of the HOTPLUG_CPU.
On x86_64, HOTPLUG_CPU and MEMORY_HOTPLUG are tied by SPM means, but on arm64
one can have MEMORY_HOTPLUG while not having picked HOTPLUG_CPU.

My point is that we might want to put the callback functions and the callback
registration for cpu-hotplug guarded by its own HOTPLUG_CPU instead of guarding it
in the same MEMORY_HOTPLUG block to make it more clear?

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-09 10:14     ` Oscar Salvador
@ 2021-04-09 10:15       ` Oscar Salvador
  2021-04-09 18:59       ` David Hildenbrand
  1 sibling, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-09 10:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams, david

On Fri, Apr 09, 2021 at 12:14:04PM +0200, Oscar Salvador wrote:
> On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> > I am not really into PMEM, and I ignore whether we need
> > CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> > If so, the following can be partly ignored.
> 
> Ok, I refreshed by memory with [1].

Bleh, being [1] https://lore.kernel.org/linux-mm/20190124231448.E102D18E@viggo.jf.intel.com/

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-09  5:06           ` Oscar Salvador
  2021-04-09  5:43             ` Wei Xu
@ 2021-04-09 15:43             ` Yang Shi
  1 sibling, 0 replies; 64+ messages in thread
From: Yang Shi @ 2021-04-09 15:43 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Thu, Apr 8, 2021 at 10:06 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote:
> > Thanks a lot for the example code. You didn't miss anything. At first
> > glance, I thought your suggestion seemed neater. Actually I
> > misunderstood what Dave said about "That could really have caused some
> > interesting problems." with multiple calls to migrate_pages(). I was
> > thinking about:
> >
> > unsigned long foo()
> > {
> >     unsigned long *ret_succeeded;
> >
> >     migrate_pages(..., ret_succeeded);
> >
> >     migrate_pages(..., ret_succeeded);
> >
> >     return *ret_succeeded;
> > }
>
> But that would not be a problem as well. I mean I am not sure what is
> foo() supposed to do.
> I assume is supposed to return the *total* number of pages that were
> migrated?
>
> Then could do something like:
>
>  unsigned long foo()
>  {
>      unsigned long ret_succeeded;
>      unsigned long total_succeeded = 0;
>
>      migrate_pages(..., &ret_succeeded);
>      total_succeeded += ret_succeeded;
>
>      migrate_pages(..., &ret_succeeded);
>      total_succeeded += ret_succeeded;
>
>      return *total_succeeded;
>  }
>
>  But AFAICS, you would have to do that with Wei Xu's version and with
>  mine, no difference there.

It is because nr_succeeded is reset for each migrate_pages() call.

You could do "*ret_succeeded += nr_succeeded" if we want an
accumulated counter, then you don't have to add total_succeeded. And
since nr_succeeded is reset for each migrate_pages() call, so both vm
counter and trace point are happy.

>
> IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning
> of the function, and never reset back, which means, we would carry the
> sum of previous nr_succeeded instead of the nr_succeeded in that round.
> That would be misleading for e.g: reclaim in case we were to call
> migrate_pages() several times, as instead of a delta value, nr_succeeded
> would accumulate.

I think the most straightforward concern is the vm counter and trace
point in migrate_pages(), if migrate_pages() is called multiple times
we may see messed up counters if nr_succeeded is not reset properly.
Of course both your and Wei's suggestion solve this problem.

But if we have usecase which returns nr_succeeded and call
migrate_pages() multiple times, I think we do want to return
accumulated value IMHO.

>
> But that won't happen neither with Wei Xu's version nor with mine.
>
> --
> Oscar Salvador
> SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-08 18:17       ` Oscar Salvador
  2021-04-08 18:21         ` Oscar Salvador
  2021-04-08 20:40         ` Yang Shi
@ 2021-04-09 15:50         ` Dave Hansen
  2021-04-09 18:47           ` Wei Xu
  2021-04-09 20:10           ` Yang Shi
  2 siblings, 2 replies; 64+ messages in thread
From: Dave Hansen @ 2021-04-09 15:50 UTC (permalink / raw)
  To: Oscar Salvador, Yang Shi
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	weixugc, Huang Ying, Dan Williams, David Hildenbrand

On 4/8/21 11:17 AM, Oscar Salvador wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
>  		cc->nr_migratepages -= nr_reclaimed;
> 
>  		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> -				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> +				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
> +				NULL);
>  	}
>  	if (ret < 0) {
>  		putback_movable_pages(&cc->migratepages);

I also considered passing NULL to mean "I don't care about
nr_succeeded".  I mostly avoided it to reduce churn.  But, looking at it
here, it does seem cleaner.

Any objections to moving over to Oscar's suggestion?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-09 15:50         ` Dave Hansen
@ 2021-04-09 18:47           ` Wei Xu
  2021-04-09 20:10           ` Yang Shi
  1 sibling, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-09 18:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oscar Salvador, Yang Shi, Dave Hansen, Linux MM,
	Linux Kernel Mailing List, Yang Shi, Huang Ying, Dan Williams,
	David Hildenbrand

On Fri, Apr 9, 2021 at 8:50 AM Dave Hansen <dave.hansen@intel.com> wrote:
> I also considered passing NULL to mean "I don't care about
> nr_succeeded".  I mostly avoided it to reduce churn.  But, looking at it
> here, it does seem cleaner.
>
> Any objections to moving over to Oscar's suggestion?

I like this approach (making *nr_succeeded an optional argument).  No
objection here.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-09 10:14     ` Oscar Salvador
  2021-04-09 10:15       ` Oscar Salvador
@ 2021-04-09 18:59       ` David Hildenbrand
  2021-04-12  7:19         ` Oscar Salvador
  1 sibling, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2021-04-09 18:59 UTC (permalink / raw)
  To: Oscar Salvador, Dave Hansen
  Cc: linux-mm, linux-kernel, shy828301, weixugc, rientjes, ying.huang,
	dan.j.williams

On 09.04.21 12:14, Oscar Salvador wrote:
> On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
>> I am not really into PMEM, and I ignore whether we need
>> CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
>> If so, the following can be partly ignored.
> 
> Ok, I refreshed by memory with [1].
>  From that, it seems that in order to use PMEM as RAM we need CONFIG_MEMORY_HOTPLUG.
> But is that always the case? Can happen that in some scenario PMEM comes ready
> to use and we do not need the hotplug trick?

The only way to add more System RAM is via add_memory() and friends like 
add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.

Memory ballooning is a different case, but there we're only adjusting 
the managed page counters.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
  2021-04-09 15:50         ` Dave Hansen
  2021-04-09 18:47           ` Wei Xu
@ 2021-04-09 20:10           ` Yang Shi
  1 sibling, 0 replies; 64+ messages in thread
From: Yang Shi @ 2021-04-09 20:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oscar Salvador, Dave Hansen, Linux MM, Linux Kernel Mailing List,
	Yang Shi, weixugc, Huang Ying, Dan Williams, David Hildenbrand

On Fri, Apr 9, 2021 at 8:50 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/8/21 11:17 AM, Oscar Salvador wrote:
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> >               cc->nr_migratepages -= nr_reclaimed;
> >
> >               ret = migrate_pages(&cc->migratepages, alloc_migration_target,
> > -                             NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> > +                             NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE,
> > +                             NULL);
> >       }
> >       if (ret < 0) {
> >               putback_movable_pages(&cc->migratepages);
>
> I also considered passing NULL to mean "I don't care about
> nr_succeeded".  I mostly avoided it to reduce churn.  But, looking at it
> here, it does seem cleaner.
>
> Any objections to moving over to Oscar's suggestion?

No, fine to me.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-01 18:32 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
  2021-04-08  8:26   ` Oscar Salvador
@ 2021-04-10  3:07   ` Wei Xu
  2021-04-14  8:08     ` Oscar Salvador
  1 sibling, 1 reply; 64+ messages in thread
From: Wei Xu @ 2021-04-10  3:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams, David Hildenbrand, Oscar Salvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> +/*
> + * node_demotion[] example:
> + *
> + * Consider a system with two sockets.  Each socket has
> + * three classes of memory attached: fast, medium and slow.
> + * Each memory class is placed in its own NUMA node.  The
> + * CPUs are placed in the node with the "fast" memory.  The
> + * 6 NUMA nodes (0-5) might be split among the sockets like
> + * this:
> + *
> + *     Socket A: 0, 1, 2
> + *     Socket B: 3, 4, 5
> + *
> + * When Node 0 fills up, its memory should be migrated to
> + * Node 1.  When Node 1 fills up, it should be migrated to
> + * Node 2.  The migration path start on the nodes with the
> + * processors (since allocations default to this node) and
> + * fast memory, progress through medium and end with the
> + * slow memory:
> + *
> + *     0 -> 1 -> 2 -> stop
> + *     3 -> 4 -> 5 -> stop
> + *
> + * This is represented in the node_demotion[] like this:
> + *
> + *     {  1, // Node 0 migrates to 1
> + *        2, // Node 1 migrates to 2
> + *       -1, // Node 2 does not migrate
> + *        4, // Node 3 migrates to 4
> + *        5, // Node 4 migrates to 5
> + *       -1} // Node 5 does not migrate
> + */

In this example, if we want to support multiple nodes as the demotion
target of a source node, we can group these nodes into three tiers
(classes):

fast class:
0 -> {1, 4}  // 1 is the preferred
3 -> {4, 1}  // 4 is the preferred

medium class:
1 -> {2, 5}  // 2 is the preferred
4 -> {5, 2}  // 5 is the preferred

slow class:
2 -> stop
5 -> stop

This can guarantee there are no cycles, either.  Does it sound sensible?

> +again:
> +       this_pass = next_pass;
> +       next_pass = NODE_MASK_NONE;
> +       /*
> +        * To avoid cycles in the migration "graph", ensure
> +        * that migration sources are not future targets by
> +        * setting them in 'used_targets'.  Do this only
> +        * once per pass so that multiple source nodes can
> +        * share a target node.
> +        *
> +        * 'used_targets' will become unavailable in future
> +        * passes.  This limits some opportunities for
> +        * multiple source nodes to share a destination.
> +        */
> +       nodes_or(used_targets, used_targets, this_pass);
> +       for_each_node_mask(node, this_pass) {
> +               int target_node = establish_migrate_target(node, &used_targets);
> +
> +               if (target_node == NUMA_NO_NODE)
> +                       continue;
> +
> +               /* Visit targets from this pass in the next pass: */
> +               node_set(target_node, next_pass);
> +       }
> +       /* Is another pass necessary? */
> +       if (!nodes_empty(next_pass))
> +               goto again;

This goto seems like exactly a "do {} while" loop.  Any particular reason not to
use "do {} while" here?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/10] mm/migrate: demote pages during reclaim
  2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
  2021-04-01 20:01   ` Yang Shi
  2021-04-08 10:47   ` Oscar Salvador
@ 2021-04-10  3:35   ` Wei Xu
  2 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-10  3:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams, Oscar Salvador

> +static unsigned int demote_page_list(struct list_head *demote_pages,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc)

sc is not needed and can be removed from demote_page_list().

Reviewed-by: Wei Xu <weixugc@google.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/10] mm/vmscan: add page demotion counter
  2021-04-01 18:32 ` [PATCH 06/10] mm/vmscan: add page demotion counter Dave Hansen
@ 2021-04-10  3:40   ` Wei Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-10  3:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, Yang Shi, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand,
	Oscar Salvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Yang Shi <yang.shi@linux.alibaba.com>
>
> Account the number of demoted pages into reclaim_state->nr_demoted.
>
> Add pgdemote_kswapd and pgdemote_direct VM counters showed in
> /proc/vmstat.
>
> [ daveh:
>    - __count_vm_events() a bit, and made them look at the THP
>      size directly rather than getting data from migrate_pages()
> ]
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>

These counters can be more useful if they are per-node.

Reviewed-by: Wei Xu <weixugc@google.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration
  2021-04-01 18:32 ` [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration Dave Hansen
  2021-04-01 20:06   ` Yang Shi
@ 2021-04-10  4:10   ` Wei Xu
  1 sibling, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-10  4:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux MM, Linux Kernel Mailing List, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams, David Hildenbrand, Oscar Salvador

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> This proposes extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.

Nit: now now -> now

> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).

Nit: it it -> it if

> diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c
> --- a/mm/vmscan.c~RECLAIM_MIGRATE       2021-03-31 15:17:40.339000190 -0700
> +++ b/mm/vmscan.c       2021-03-31 15:17:40.357000190 -0700
> @@ -1074,6 +1074,9 @@ static bool migrate_demote_page_ok(struc
>         VM_BUG_ON_PAGE(PageHuge(page), page);
>         VM_BUG_ON_PAGE(PageLRU(page), page);
>
> +       if (!(node_reclaim_mode & RECLAIM_MIGRATE))
> +               return false;
> +

As I commented on an earlier patch in this series, the RECLAIM_MIGRATE check
needs to be added to other new callers of next_demotion_node() as well to avoid
unnecessarily splitting THP pages when neither swap nor RECLAIM_MIGRATE
is enabled.  It can be too late to check RECLAIM_MIGRATE only in
migrate_demote_page_ok().


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-09 18:59       ` David Hildenbrand
@ 2021-04-12  7:19         ` Oscar Salvador
  2021-04-12  9:19           ` David Hildenbrand
  0 siblings, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-12  7:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dave Hansen, linux-mm, linux-kernel, shy828301, weixugc,
	rientjes, ying.huang, dan.j.williams

On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote:
 
> The only way to add more System RAM is via add_memory() and friends like
> add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.

Yeah, my point was more towards whether PMEM can come in a way that it does
not have to be hotplugged, but come functional by default (as RAM).
But after having read all papers out there, I do not think that it is possible.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
  2021-04-12  7:19         ` Oscar Salvador
@ 2021-04-12  9:19           ` David Hildenbrand
  0 siblings, 0 replies; 64+ messages in thread
From: David Hildenbrand @ 2021-04-12  9:19 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, linux-mm, linux-kernel, shy828301, weixugc,
	rientjes, ying.huang, dan.j.williams

On 12.04.21 09:19, Oscar Salvador wrote:
> On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote:
>   
>> The only way to add more System RAM is via add_memory() and friends like
>> add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.
> 
> Yeah, my point was more towards whether PMEM can come in a way that it does
> not have to be hotplugged, but come functional by default (as RAM).
> But after having read all papers out there, I do not think that it is possible.
> 

You mean e.g., configuring in the BIOS/firmware how an NVDIMM will get 
exposed to the OS (pmem vs. RAM). I once heard something about that, not 
sure if it's real. But from Linux' perspective, it would simply be 
System RAM and it would get treated like that.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-10  3:07   ` Wei Xu
@ 2021-04-14  8:08     ` Oscar Salvador
  2021-04-14  8:11       ` Oscar Salvador
                         ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-14  8:08 UTC (permalink / raw)
  To: Wei Xu
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand

On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote:
> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > + * When Node 0 fills up, its memory should be migrated to
> > + * Node 1.  When Node 1 fills up, it should be migrated to
> > + * Node 2.  The migration path start on the nodes with the
> > + * processors (since allocations default to this node) and
> > + * fast memory, progress through medium and end with the
> > + * slow memory:
> > + *
> > + *     0 -> 1 -> 2 -> stop
> > + *     3 -> 4 -> 5 -> stop
> > + *
> > + * This is represented in the node_demotion[] like this:
> > + *
> > + *     {  1, // Node 0 migrates to 1
> > + *        2, // Node 1 migrates to 2
> > + *       -1, // Node 2 does not migrate
> > + *        4, // Node 3 migrates to 4
> > + *        5, // Node 4 migrates to 5
> > + *       -1} // Node 5 does not migrate
> > + */
> 
> In this example, if we want to support multiple nodes as the demotion
> target of a source node, we can group these nodes into three tiers
> (classes):
> 
> fast class:
> 0 -> {1, 4}  // 1 is the preferred
> 3 -> {4, 1}  // 4 is the preferred
> 
> medium class:
> 1 -> {2, 5}  // 2 is the preferred
> 4 -> {5, 2}  // 5 is the preferred
> 
> slow class:
> 2 -> stop
> 5 -> stop

Hi Wei Xu,

I have some questions about it

Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
are PMEM, right?
Then, what stands for medium class/memory?

In Dave's example, list is created in a way that stays local to the socket,
and we go from the fast one to the slow one.
In yours, lists are created taking the fastest nodes from all sockets and
we work our way down, which means have cross-socket nodes in the list.
How much of a penalty is that?

And while I get your point, I am not sure if that is what we pretend here.
This patchset aims to place cold pages that are about to be reclaim in slower
nodes to give them a second chance, while your design seems more to have kind
of different memory clases and be able to place applications in one of those tiers
depending on its demands or sysadmin-demand.

Could you expand some more?

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-14  8:08     ` Oscar Salvador
@ 2021-04-14  8:11       ` Oscar Salvador
  2021-04-14  8:12       ` David Hildenbrand
  2021-04-15  4:07       ` Wei Xu
  2 siblings, 0 replies; 64+ messages in thread
From: Oscar Salvador @ 2021-04-14  8:11 UTC (permalink / raw)
  To: Wei Xu
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand

On Wed, Apr 14, 2021 at 10:08:54AM +0200, Oscar Salvador wrote:
> In Dave's example, list is created in a way that stays local to the socket,
> and we go from the fast one to the slow one.

Or maybe it is just because find_next_best_node() does not know any better
and creates the list that way?

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-14  8:08     ` Oscar Salvador
  2021-04-14  8:11       ` Oscar Salvador
@ 2021-04-14  8:12       ` David Hildenbrand
  2021-04-14  8:14         ` Oscar Salvador
  2021-04-15  4:07       ` Wei Xu
  2 siblings, 1 reply; 64+ messages in thread
From: David Hildenbrand @ 2021-04-14  8:12 UTC (permalink / raw)
  To: Oscar Salvador, Wei Xu
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams

On 14.04.21 10:08, Oscar Salvador wrote:
> On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote:
>> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>> + * When Node 0 fills up, its memory should be migrated to
>>> + * Node 1.  When Node 1 fills up, it should be migrated to
>>> + * Node 2.  The migration path start on the nodes with the
>>> + * processors (since allocations default to this node) and
>>> + * fast memory, progress through medium and end with the
>>> + * slow memory:
>>> + *
>>> + *     0 -> 1 -> 2 -> stop
>>> + *     3 -> 4 -> 5 -> stop
>>> + *
>>> + * This is represented in the node_demotion[] like this:
>>> + *
>>> + *     {  1, // Node 0 migrates to 1
>>> + *        2, // Node 1 migrates to 2
>>> + *       -1, // Node 2 does not migrate
>>> + *        4, // Node 3 migrates to 4
>>> + *        5, // Node 4 migrates to 5
>>> + *       -1} // Node 5 does not migrate
>>> + */
>>
>> In this example, if we want to support multiple nodes as the demotion
>> target of a source node, we can group these nodes into three tiers
>> (classes):
>>
>> fast class:
>> 0 -> {1, 4}  // 1 is the preferred
>> 3 -> {4, 1}  // 4 is the preferred
>>
>> medium class:
>> 1 -> {2, 5}  // 2 is the preferred
>> 4 -> {5, 2}  // 5 is the preferred
>>
>> slow class:
>> 2 -> stop
>> 5 -> stop
> 
> Hi Wei Xu,
> 
> I have some questions about it
> 
> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
> are PMEM, right?
> Then, what stands for medium class/memory?

My guest best is that fast class is something like HBM (High Bandwidth 
Memory), medium class is ordinary RAM, slow class is PMEM.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-14  8:12       ` David Hildenbrand
@ 2021-04-14  8:14         ` Oscar Salvador
  2021-04-14  8:20           ` David Hildenbrand
  0 siblings, 1 reply; 64+ messages in thread
From: Oscar Salvador @ 2021-04-14  8:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wei Xu, Dave Hansen, Linux MM, Linux Kernel Mailing List,
	Yang Shi, David Rientjes, Huang Ying, Dan Williams

On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote:
> My guest best is that fast class is something like HBM (High Bandwidth
> Memory), medium class is ordinary RAM, slow class is PMEM.

I see, thanks for the hint David ;-)

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-14  8:14         ` Oscar Salvador
@ 2021-04-14  8:20           ` David Hildenbrand
  0 siblings, 0 replies; 64+ messages in thread
From: David Hildenbrand @ 2021-04-14  8:20 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Wei Xu, Dave Hansen, Linux MM, Linux Kernel Mailing List,
	Yang Shi, David Rientjes, Huang Ying, Dan Williams

On 14.04.21 10:14, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote:
>> My guest best is that fast class is something like HBM (High Bandwidth

haha, whatever happened in that sentence: s/guest best/best guess/

>> Memory), medium class is ordinary RAM, slow class is PMEM.
> 
> I see, thanks for the hint David ;-)
> 


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-14  8:08     ` Oscar Salvador
  2021-04-14  8:11       ` Oscar Salvador
  2021-04-14  8:12       ` David Hildenbrand
@ 2021-04-15  4:07       ` Wei Xu
  2021-04-15 15:35         ` Dave Hansen
  2 siblings, 1 reply; 64+ messages in thread
From: Wei Xu @ 2021-04-15  4:07 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand

On Wed, Apr 14, 2021 at 1:08 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> Hi Wei Xu,
>
> I have some questions about it
>
> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
> are PMEM, right?
> Then, what stands for medium class/memory?

That is Dave's example.  I think David's guess makes sense (HBM - fast, DRAM -
medium, PMEM - slow).  It may also be possible that we have DDR5 as fast,
CXL-DDR4 as medium, and CXL-PMEM as slow.  But the most likely use cases for
now should be just two tiers: DRAM vs PMEM or other types of slower
memory devices.

> In Dave's example, list is created in a way that stays local to the socket,
> and we go from the fast one to the slow one.
> In yours, lists are created taking the fastest nodes from all sockets and
> we work our way down, which means have cross-socket nodes in the list.
> How much of a penalty is that?

Cross-socket demotion is certainly more expensive.  But because it is
sequential access
and can also be optimized with non-temporal stores, it may not be much
slower than
demotion to a local node in the next tier.  The actual penalty will
depend on the devices.

> And while I get your point, I am not sure if that is what we pretend here.
> This patchset aims to place cold pages that are about to be reclaim in slower
> nodes to give them a second chance, while your design seems more to have kind
> of different memory clases and be able to place applications in one of those tiers
> depending on its demands or sysadmin-demand.
>
> Could you expand some more?

Sure.  What I have described has the same goal as Dave's patchset,
i,e, to demote
cold pages to the slower nodes when they are about to be reclaimed.  The only
difference is that in my suggestion the demotion target of a fast tier
node is expanded
from a single node to a set of nodes from the slow tier and one node
in such a set
can be marked as the preferred/local demotion target.   This can help
enable more
flexible demotion policies to be configured, such as to allow a cgroup
to allocate from
all fast tier nodes, but only demote to a local slow tier node.  Such
a policy can reduce
memory stranding at the fast tier (compared to if memory hardwall is
used) and still
allow demotion from all fast tier nodes without incurring the expensive random
accesses to the demoted pages if they were demoted to remote slow tier nodes.

I understand that Dave started this patchset with a simplified
demotion path definition,
which I agree.  Meanwhile, I think this more generalized definition of
demotion path
is useful and can also be important for some use cases.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-15  4:07       ` Wei Xu
@ 2021-04-15 15:35         ` Dave Hansen
  2021-04-15 20:25           ` Wei Xu
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-15 15:35 UTC (permalink / raw)
  To: Wei Xu, Oscar Salvador
  Cc: Dave Hansen, Linux MM, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams, David Hildenbrand

On 4/14/21 9:07 PM, Wei Xu wrote:
> On Wed, Apr 14, 2021 at 1:08 AM Oscar Salvador <osalvador@suse.de> wrote:
>> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
>> are PMEM, right?
>> Then, what stands for medium class/memory?
> 
> That is Dave's example.  I think David's guess makes sense (HBM - fast, DRAM -
> medium, PMEM - slow).  It may also be possible that we have DDR5 as fast,
> CXL-DDR4 as medium, and CXL-PMEM as slow.  But the most likely use cases for
> now should be just two tiers: DRAM vs PMEM or other types of slower
> memory devices.

Yes, it would be nice to apply this to fancier tiering systems.  But
DRAM/PMEM combos are out in the wild today and it's where I expect this
to be used first.

> This can help enable more flexible demotion policies to be
> configured, such as to allow a cgroup to allocate from all fast tier
> nodes, but only demote to a local slow tier node.  Such a policy can
> reduce memory stranding at the fast tier (compared to if memory
> hardwall is used) and still allow demotion from all fast tier nodes
> without incurring the expensive random accesses to the demoted pages
> if they were demoted to remote slow tier nodes.

Could you explain this stranding effect in a bit more detail?  I'm not
quite following.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-04-15 15:35         ` Dave Hansen
@ 2021-04-15 20:25           ` Wei Xu
  0 siblings, 0 replies; 64+ messages in thread
From: Wei Xu @ 2021-04-15 20:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Oscar Salvador, Dave Hansen, Linux MM, Linux Kernel Mailing List,
	Yang Shi, David Rientjes, Huang Ying, Dan Williams,
	David Hildenbrand

On Thu, Apr 15, 2021 at 8:35 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > This can help enable more flexible demotion policies to be
> > configured, such as to allow a cgroup to allocate from all fast tier
> > nodes, but only demote to a local slow tier node.  Such a policy can
> > reduce memory stranding at the fast tier (compared to if memory
> > hardwall is used) and still allow demotion from all fast tier nodes
> > without incurring the expensive random accesses to the demoted pages
> > if they were demoted to remote slow tier nodes.
>
> Could you explain this stranding effect in a bit more detail?  I'm not
> quite following.

By memory stranding, I mean that memory on a machine (or a NUMA node)
cannot be utilized even under extremely high work loads.  Memory
stranding happens usually due to mismatches between job/machine
shapes as well as resource fragmentation resulted from bin-packing
scheduling.  It is an important problem for cloud resource efficiency.

If NUMA hardwalling is used, we effectively split a single machine
into multiple smaller machines based on NUMA nodes.  This changes the
machine shapes and also makes memory more fragmented, which can lead
to more memory being stranded.

Here is a simple example:  Suppose that each machine has 2 NUMA nodes,
each with 4 cores and 5GB RAM, and all the jobs have the shape of
2 CPUs and 3GB memory.  Without NUMA memory hardwalling, we can pack
3 jobs onto each machine, which leaves 1GB memory and 2 cores in
stranding.  However, with NUMA memory hardwalling enabled, we can then
only pack 2 jobs onto each machine (one job on each NUMA node), which
increases the resource stranding to 4GB memory and 4 cores.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
  2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
                   ` (9 preceding siblings ...)
  2021-04-01 18:32 ` [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration Dave Hansen
@ 2021-04-16 12:35 ` Michal Hocko
  2021-04-16 14:26   ` Dave Hansen
  10 siblings, 1 reply; 64+ messages in thread
From: Michal Hocko @ 2021-04-16 12:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, rientjes, ying.huang,
	dan.j.williams, david, osalvador, weixugc

Hi,
I am really sorry to jump into this train sooo late. I have quickly
glanced through the series and I have some questions/concerns. Let me
express them here rather than in specific patches.

First of all I do think that demotion is a useful way to balance the
memory in general. And that is not really bound to PMEM equipped
systems. There are larger NUMA machines which are not trivial to
partition and our existing NUMA APIs are far from ideal to help with
that.

I do appreciate that the whole thing is an opt in because this might
break workloads which are careful with the placement. I am not sure
there is a way to handle constrains in an optimal way if that is
possible at all in some cases (e.g. do we have a way to track page to
its cpuset resp. task mempolicy in all cases?).

The cover letter is focusing on usecases but it doesn't really provide
so let me try to lay it down here (let's see whether I missed something
important).
- order for demontion defines a very simple fallback to a single node
  based on the proximity but cycles are not allowed in the fallback
  mask.
  I have to confess that I haven't grasped the initialization
  completely. There is a nice comment explaining a 2 socket system with
  3 different NUMA nodes attached to it with one node being terminal.
  This is OK if the terminal node is PMEM but how that fits into usual
  NUMA setups. E.g.
  4 nodes each with its set of CPUs
  node distances:
  node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10
  Do I get it right that Node 3 would be terminal?
- The demotion is controlled by node_reclaim_mode but unlike other modes
  it applies to both direct and kswapd reclaims.
  I do not see that explained anywhere though.
- The demotion is implemented at shrink_page_list level which migrates
  pages in the first round and then falls back to the regular reclaim
  when migration fails. This means that the reclaim context
  (PF_MEMALLOC) will allocate memory so it has access to full memory
  reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
  mask which looks like a bug rather than an intention. Btw. using
  GFP_NOWAIT in the allocation callback would make more things clear
  IMO.
- Memcg reclaim is excluded from all this because it is not NUMA aware
  which makes sense to me.
- Anonymous pages are bit tricky because they can be demoted even when
  they cannot be reclaimed due to no (or no available) swap storage.
  Unless I have missed something the second round will try to reclaim
  them even the later is true and I am not sure this is completely OK.

I hope I've captured all important parts. There are some more details
but they do not seem that important. 

I am still trying to digest the whole thing but at least jamming
node_reclaim logic into kswapd seems strange to me. Need to think more
about that though.

Btw. do you have any numbers from running this with some real work
workload?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
  2021-04-16 12:35 ` [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Michal Hocko
@ 2021-04-16 14:26   ` Dave Hansen
  2021-04-16 15:02     ` Michal Hocko
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-04-16 14:26 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen
  Cc: linux-mm, linux-kernel, yang.shi, rientjes, ying.huang,
	dan.j.williams, david, osalvador, weixugc

On 4/16/21 5:35 AM, Michal Hocko wrote:
>   I have to confess that I haven't grasped the initialization
>   completely. There is a nice comment explaining a 2 socket system with
>   3 different NUMA nodes attached to it with one node being terminal.
>   This is OK if the terminal node is PMEM but how that fits into usual
>   NUMA setups. E.g.
>   4 nodes each with its set of CPUs
>   node distances:
>   node   0   1   2   3
>   0:  10  20  20  20
>   1:  20  10  20  20
>   2:  20  20  10  20
>   3:  20  20  20  10
>   Do I get it right that Node 3 would be terminal?

Yes, I think Node 3 would end up being the terminal node in that setup.

That said, I'm not sure how much I expect folks to use this on
traditional, non-tiered setups.  It's also hard to argue what the
migration order *should* be when all the nodes are uniform.

> - The demotion is controlled by node_reclaim_mode but unlike other modes
>   it applies to both direct and kswapd reclaims.
>   I do not see that explained anywhere though.

That's an interesting observation.  Let me do a bit of research and I'll
update the Documentation/ and the changelog.

> - The demotion is implemented at shrink_page_list level which migrates
>   pages in the first round and then falls back to the regular reclaim
>   when migration fails. This means that the reclaim context
>   (PF_MEMALLOC) will allocate memory so it has access to full memory
>   reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
>   mask which looks like a bug rather than an intention. Btw. using
>   GFP_NOWAIT in the allocation callback would make more things clear
>   IMO.

Yes, the lack of __GFP_NO_MEMALLOC is a bug.  I'll fix that up.

GFP_NOWAIT _seems_ like it will work.  I'll give it a shot.

> - Memcg reclaim is excluded from all this because it is not NUMA aware
>   which makes sense to me.
> - Anonymous pages are bit tricky because they can be demoted even when
>   they cannot be reclaimed due to no (or no available) swap storage.
>   Unless I have missed something the second round will try to reclaim
>   them even the later is true and I am not sure this is completely OK.

What we want is something like this:

Swap Space / Demotion OK  -> Can Reclaim
Swap Space / Demotion Off -> Can Reclaim
Swap Full  / Demotion OK  -> Can Reclaim
Swap Full  / Demotion Off -> No Reclaim

I *think* that's what can_reclaim_anon_pages() ends up doing.  Maybe I'm
misunderstanding what you are referring to, though.  By "second round"
did you mean when we do reclaim on a node which is a terminal node?

> I am still trying to digest the whole thing but at least jamming
> node_reclaim logic into kswapd seems strange to me. Need to think more
> about that though.

I'm entirely open to other ways to do the opt-in.  It seemed sane at the
time, but I also understand the kswapd concern.

> Btw. do you have any numbers from running this with some real work
> workload?

Yes, quite a bit.  Do you have a specific scenario in mind?  Folks seem
to come at this in two different ways:

Some want to know how much DRAM they can replace by buying some PMEM.
They tend to care about how much adding the (cheaper) PMEM slows them
down versus (expensive) DRAM.  They're making a cost-benefit call

Others want to repurpose some PMEM they already have.  They want to know
how much using PMEM in this way will speed them up.  They will basically
take any speedup they can get.

I ask because as a kernel developer with PMEM in my systems, I find the
"I'll take what I can get" case more personally appealing.  But, the
business folks are much more keen on the "DRAM replacement" use.  Do you
have any thoughts on what you would like to see?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
  2021-04-16 14:26   ` Dave Hansen
@ 2021-04-16 15:02     ` Michal Hocko
  2021-04-21  2:39       ` Huang, Ying
  2021-05-07  6:14       ` Huang, Ying
  0 siblings, 2 replies; 64+ messages in thread
From: Michal Hocko @ 2021-04-16 15:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-mm, linux-kernel, yang.shi, rientjes,
	ying.huang, dan.j.williams, david, osalvador, weixugc

On Fri 16-04-21 07:26:43, Dave Hansen wrote:
> On 4/16/21 5:35 AM, Michal Hocko wrote:
> >   I have to confess that I haven't grasped the initialization
> >   completely. There is a nice comment explaining a 2 socket system with
> >   3 different NUMA nodes attached to it with one node being terminal.
> >   This is OK if the terminal node is PMEM but how that fits into usual
> >   NUMA setups. E.g.
> >   4 nodes each with its set of CPUs
> >   node distances:
> >   node   0   1   2   3
> >   0:  10  20  20  20
> >   1:  20  10  20  20
> >   2:  20  20  10  20
> >   3:  20  20  20  10
> >   Do I get it right that Node 3 would be terminal?
> 
> Yes, I think Node 3 would end up being the terminal node in that setup.
> 
> That said, I'm not sure how much I expect folks to use this on
> traditional, non-tiered setups.  It's also hard to argue what the
> migration order *should* be when all the nodes are uniform.

Well, they are not really uniform. The access latency differ and I can
imagine that spreading page cache to a distant node might be just much
better than an IO involved in the refault.

On the other hand I do understand that restricting the feature to CPU
less NUMA setups is quite sane to give us a better understanding of how
this can be used and improve on top. Maybe we will learn that we want to
have the demotion path admin controlable (on the system level or maybe
even more fine grained on the memcg/cpuset).

[...]
> > - The demotion is implemented at shrink_page_list level which migrates
> >   pages in the first round and then falls back to the regular reclaim
> >   when migration fails. This means that the reclaim context
> >   (PF_MEMALLOC) will allocate memory so it has access to full memory
> >   reserves. Btw. I do not __GFP_NO_MEMALLOC anywhere in the allocation
> >   mask which looks like a bug rather than an intention. Btw. using
> >   GFP_NOWAIT in the allocation callback would make more things clear
> >   IMO.
> 
> Yes, the lack of __GFP_NO_MEMALLOC is a bug.  I'll fix that up.
> 
> GFP_NOWAIT _seems_ like it will work.  I'll give it a shot.

Let me clarify a bit. The slow path does involve __gfp_pfmemalloc_flags
before bailing out for non sleeping allocation. So you would need both.
Unless you want to involve reclaim on the target node while you are
reclaiming the origin node.

> > - Memcg reclaim is excluded from all this because it is not NUMA aware
> >   which makes sense to me.
> > - Anonymous pages are bit tricky because they can be demoted even when
> >   they cannot be reclaimed due to no (or no available) swap storage.
> >   Unless I have missed something the second round will try to reclaim
> >   them even the later is true and I am not sure this is completely OK.
> 
> What we want is something like this:
> 
> Swap Space / Demotion OK  -> Can Reclaim
> Swap Space / Demotion Off -> Can Reclaim
> Swap Full  / Demotion OK  -> Can Reclaim
> Swap Full  / Demotion Off -> No Reclaim
> 
> I *think* that's what can_reclaim_anon_pages() ends up doing.  Maybe I'm
> misunderstanding what you are referring to, though.  By "second round"
> did you mean when we do reclaim on a node which is a terminal node?

No, I mean the migration failure case where you splice back to the page
list to reclaim. In that round you do not demote and want to reclaim.
But a lack of swap space will make that page unreclaimable. I suspect
this would just work out fine but I am not sure from the top of my head.

> > I am still trying to digest the whole thing but at least jamming
> > node_reclaim logic into kswapd seems strange to me. Need to think more
> > about that though.
> 
> I'm entirely open to other ways to do the opt-in.  It seemed sane at the
> time, but I also understand the kswapd concern.
> 
> > Btw. do you have any numbers from running this with some real work
> > workload?
> 
> Yes, quite a bit.  Do you have a specific scenario in mind?  Folks seem
> to come at this in two different ways:
> 
> Some want to know how much DRAM they can replace by buying some PMEM.
> They tend to care about how much adding the (cheaper) PMEM slows them
> down versus (expensive) DRAM.  They're making a cost-benefit call
> 
> Others want to repurpose some PMEM they already have.  They want to know
> how much using PMEM in this way will speed them up.  They will basically
> take any speedup they can get.
> 
> I ask because as a kernel developer with PMEM in my systems, I find the
> "I'll take what I can get" case more personally appealing.  But, the
> business folks are much more keen on the "DRAM replacement" use.  Do you
> have any thoughts on what you would like to see?

I was thinking about typical large in memory processing (e.g. in memory
databases) where the hot part of the working set is only a portion and
spilling over to a slower memory can be still benefitial because IO +
data preprocessing on cold data is much slower.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
  2021-04-16 15:02     ` Michal Hocko
@ 2021-04-21  2:39       ` Huang, Ying
  2021-05-07  6:14       ` Huang, Ying
  1 sibling, 0 replies; 64+ messages in thread
From: Huang, Ying @ 2021-04-21  2:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Dave Hansen, linux-mm, linux-kernel, yang.shi,
	rientjes, dan.j.williams, david, osalvador, weixugc

Michal Hocko <mhocko@suse.com> writes:

> On Fri 16-04-21 07:26:43, Dave Hansen wrote:
>> On 4/16/21 5:35 AM, Michal Hocko wrote:
>> > - Anonymous pages are bit tricky because they can be demoted even when
>> >   they cannot be reclaimed due to no (or no available) swap storage.
>> >   Unless I have missed something the second round will try to reclaim
>> >   them even the later is true and I am not sure this is completely OK.
>> 
>> What we want is something like this:
>> 
>> Swap Space / Demotion OK  -> Can Reclaim
>> Swap Space / Demotion Off -> Can Reclaim
>> Swap Full  / Demotion OK  -> Can Reclaim
>> Swap Full  / Demotion Off -> No Reclaim
>> 
>> I *think* that's what can_reclaim_anon_pages() ends up doing.  Maybe I'm
>> misunderstanding what you are referring to, though.  By "second round"
>> did you mean when we do reclaim on a node which is a terminal node?
>
> No, I mean the migration failure case where you splice back to the page
> list to reclaim. In that round you do not demote and want to reclaim.
> But a lack of swap space will make that page unreclaimable. I suspect
> this would just work out fine but I am not sure from the top of my head.

I have tested this via injecting some migration errors (returning 0 from
demote_page_list() before migration) on a system without swap.  The
system can still work properly.  In ftrace, I can find add_to_swap() are
called much more times, and it can deal with the situation where the
swap space isn't available.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard
  2021-04-16 15:02     ` Michal Hocko
  2021-04-21  2:39       ` Huang, Ying
@ 2021-05-07  6:14       ` Huang, Ying
  1 sibling, 0 replies; 64+ messages in thread
From: Huang, Ying @ 2021-05-07  6:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Dave Hansen, linux-mm, linux-kernel, yang.shi,
	rientjes, dan.j.williams, david, osalvador, weixugc

Hi, Michal,

Michal Hocko <mhocko@suse.com> writes:

[...]

>> 
>> > Btw. do you have any numbers from running this with some real work
>> > workload?
>> 
>> Yes, quite a bit.  Do you have a specific scenario in mind?  Folks seem
>> to come at this in two different ways:
>> 
>> Some want to know how much DRAM they can replace by buying some PMEM.
>> They tend to care about how much adding the (cheaper) PMEM slows them
>> down versus (expensive) DRAM.  They're making a cost-benefit call
>> 
>> Others want to repurpose some PMEM they already have.  They want to know
>> how much using PMEM in this way will speed them up.  They will basically
>> take any speedup they can get.
>> 
>> I ask because as a kernel developer with PMEM in my systems, I find the
>> "I'll take what I can get" case more personally appealing.  But, the
>> business folks are much more keen on the "DRAM replacement" use.  Do you
>> have any thoughts on what you would like to see?
>
> I was thinking about typical large in memory processing (e.g. in memory
> databases) where the hot part of the working set is only a portion and
> spilling over to a slower memory can be still benefitial because IO +
> data preprocessing on cold data is much slower.

We have tested the patchset with the postgresql and pgbench.  On a
machine with DRAM and PMEM, the kernel with the patchset can improve the
score of pgbench up to 22.1% compared with that of the DRAM only + disk
case.  This comes from the reduced disk read throughput (which reduces
up to 70.8%).

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-03-04 23:59 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
@ 2021-03-08 23:59   ` Yang Shi
  0 siblings, 0 replies; 64+ messages in thread
From: Yang Shi @ 2021-03-08 23:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Linux MM, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams, David Hildenbrand, Oscar Salvador

On Thu, Mar 4, 2021 at 4:00 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list
> would be to follow the page allocator fallback lists.  Those
> lists already tell us if memory is full where to look next.  It
> would also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to
> migration cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> The protocol for node_demotion[] access and writing is not
> standard.  It has no specific locking and is intended to be read
> locklessly.  Readers must take care to avoid observing changes
> that appear incoherent.  This was done so that node_demotion[]
> locking has no chance of becoming a bottleneck on large systems
> with lots of CPUs in direct reclaim.
>
> This code is unused for now.  It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> changes from 20200122:
>  * Add big node_demotion[] comment

Thanks for adding the comment. Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>
>  b/mm/internal.h   |    5 +
>  b/mm/migrate.c    |  174 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  b/mm/page_alloc.c |    4 -
>  3 files changed, 180 insertions(+), 3 deletions(-)
>
> diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
> --- a/mm/internal.h~auto-setup-default-migration-path-from-firmware     2021-03-04 15:35:52.407806439 -0800
> +++ b/mm/internal.h     2021-03-04 15:35:52.426806439 -0800
> @@ -520,12 +520,17 @@ static inline void mminit_validate_memmo
>
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>  #else
>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>                                 unsigned int order)
>  {
>         return NODE_RECLAIM_NOSCAN;
>  }
> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
> +{
> +       return NUMA_NO_NODE;
> +}
>  #endif
>
>  extern int hwpoison_filter(struct page *p);
> diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
> --- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware      2021-03-04 15:35:52.409806439 -0800
> +++ b/mm/migrate.c      2021-03-04 15:35:52.427806439 -0800
> @@ -1157,6 +1157,44 @@ out:
>         return rc;
>  }
>
> +
> +/*
> + * node_demotion[] example:
> + *
> + * Consider a system with two sockets.  Each socket has
> + * three classes of memory attached: fast, medium and slow.
> + * Each memory class is placed in its own NUMA node.  The
> + * CPUs are placed in the node with the "fast" memory.  The
> + * 6 NUMA nodes (0-5) might be split among the sockets like
> + * this:
> + *
> + *     Socket A: 0, 1, 2
> + *     Socket B: 3, 4, 5
> + *
> + * When Node 0 fills up, its memory should be migrated to
> + * Node 1.  When Node 1 fills up, it should be migrated to
> + * Node 2.  The migration path start on the nodes with the
> + * processors (since allocations default to this node) and
> + * fast memory, progress through medium and end with the
> + * slow memory:
> + *
> + *     0 -> 1 -> 2 -> stop
> + *     3 -> 4 -> 5 -> stop
> + *
> + * This is represented in the node_demotion[] like this:
> + *
> + *     {  1, // Node 0 migrates to 1
> + *        2, // Node 1 migrates to 2
> + *       -1, // Node 2 does not migrate
> + *        4, // Node 3 migrates to 1
> + *        5, // Node 4 migrates to 2
> + *       -1} // Node 5 does not migrate
> + */
> +
> +/*
> + * Writes to this array occur without locking.  READ_ONCE()
> + * is recommended for readers to ensure consistent reads.
> + */
>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>         {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>
> @@ -1171,7 +1209,13 @@ static int node_demotion[MAX_NUMNODES] _
>   */
>  int next_demotion_node(int node)
>  {
> -       return node_demotion[node];
> +       /*
> +        * node_demotion[] is updated without excluding
> +        * this function from running.  READ_ONCE() avoids
> +        * reading multiple, inconsistent 'node' values
> +        * during an update.
> +        */
> +       return READ_ONCE(node_demotion[node]);
>  }
>
>  /*
> @@ -3175,3 +3219,131 @@ void migrate_vma_finalize(struct migrate
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);
>  #endif /* CONFIG_DEVICE_PRIVATE */
> +
> +/* Disable reclaim-based migration. */
> +static void disable_all_migrate_targets(void)
> +{
> +       int node;
> +
> +       for_each_online_node(node)
> +               node_demotion[node] = NUMA_NO_NODE;
> +}
> +
> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> +       int migration_target;
> +
> +       /*
> +        * Can not set a migration target on a
> +        * node with it already set.
> +        *
> +        * No need for READ_ONCE() here since this
> +        * in the write path for node_demotion[].
> +        * This should be the only thread writing.
> +        */
> +       if (node_demotion[node] != NUMA_NO_NODE)
> +               return NUMA_NO_NODE;
> +
> +       migration_target = find_next_best_node(node, used);
> +       if (migration_target == NUMA_NO_NODE)
> +               return NUMA_NO_NODE;
> +
> +       node_demotion[node] = migration_target;
> +
> +       return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided.  If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[].  However, it can not run simultaneously
> + * with itself.  Exclusion is provided by memory hotplug events
> + * being single-threaded.
> + */
> +static void __set_migration_target_nodes(void)
> +{
> +       nodemask_t next_pass    = NODE_MASK_NONE;
> +       nodemask_t this_pass    = NODE_MASK_NONE;
> +       nodemask_t used_targets = NODE_MASK_NONE;
> +       int node;
> +
> +       /*
> +        * Avoid any oddities like cycles that could occur
> +        * from changes in the topology.  This will leave
> +        * a momentary gap when migration is disabled.
> +        */
> +       disable_all_migrate_targets();
> +
> +       /*
> +        * Ensure that the "disable" is visible across the system.
> +        * Readers will see either a combination of before+disable
> +        * state or disable+after.  They will never see before and
> +        * after state together.
> +        *
> +        * The before+after state together might have cycles and
> +        * could cause readers to do things like loop until this
> +        * function finishes.  This ensures they can only see a
> +        * single "bad" read and would, for instance, only loop
> +        * once.
> +        */
> +       smp_wmb();
> +
> +       /*
> +        * Allocations go close to CPUs, first.  Assume that
> +        * the migration path starts at the nodes with CPUs.
> +        */
> +       next_pass = node_states[N_CPU];
> +again:
> +       this_pass = next_pass;
> +       next_pass = NODE_MASK_NONE;
> +       /*
> +        * To avoid cycles in the migration "graph", ensure
> +        * that migration sources are not future targets by
> +        * setting them in 'used_targets'.  Do this only
> +        * once per pass so that multiple source nodes can
> +        * share a target node.
> +        *
> +        * 'used_targets' will become unavailable in future
> +        * passes.  This limits some opportunities for
> +        * multiple source nodes to share a destination.
> +        */
> +       nodes_or(used_targets, used_targets, this_pass);
> +       for_each_node_mask(node, this_pass) {
> +               int target_node = establish_migrate_target(node, &used_targets);
> +
> +               if (target_node == NUMA_NO_NODE)
> +                       continue;
> +
> +               /* Visit targets from this pass in the next pass: */
> +               node_set(target_node, next_pass);
> +       }
> +       /* Is another pass necessary? */
> +       if (!nodes_empty(next_pass))
> +               goto again;
> +}
> +
> +/*
> + * For callers that do not hold get_online_mems() already.
> + */
> +static void set_migration_target_nodes(void)
> +{
> +       get_online_mems();
> +       __set_migration_target_nodes();
> +       put_online_mems();
> +}
> diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
> --- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware   2021-03-04 15:35:52.422806439 -0800
> +++ b/mm/page_alloc.c   2021-03-04 15:35:52.429806439 -0800
> @@ -3916,7 +3916,7 @@ retry:
>                         if (alloc_flags & ALLOC_NO_WATERMARKS)
>                                 goto try_this_zone;
>
> -                       if (!node_reclaim_enabled() ||
> +                       if (node_reclaim_mode == 0 ||
>                             !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
>                                 continue;
>
> @@ -5773,7 +5773,7 @@ static int node_load[MAX_NUMNODES];
>   *
>   * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
>   */
> -static int find_next_best_node(int node, nodemask_t *used_node_mask)
> +int find_next_best_node(int node, nodemask_t *used_node_mask)
>  {
>         int n, val;
>         int min_val = INT_MAX;
> _
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 02/10] mm/numa: automatically generate node migration order
  2021-03-04 23:59 [PATCH 00/10] [v6] " Dave Hansen
@ 2021-03-04 23:59 ` Dave Hansen
  2021-03-08 23:59   ` Yang Shi
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Hansen @ 2021-03-04 23:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang,
	dan.j.williams, david, osalvador


From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists.  Those
lists already tell us if memory is full where to look next.  It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

changes from 20200122:
 * Add big node_demotion[] comment
---

 b/mm/internal.h   |    5 +
 b/mm/migrate.c    |  174 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    4 -
 3 files changed, 180 insertions(+), 3 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2021-03-04 15:35:52.407806439 -0800
+++ b/mm/internal.h	2021-03-04 15:35:52.426806439 -0800
@@ -520,12 +520,17 @@ static inline void mminit_validate_memmo
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
 				unsigned int order)
 {
 	return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2021-03-04 15:35:52.409806439 -0800
+++ b/mm/migrate.c	2021-03-04 15:35:52.427806439 -0800
@@ -1157,6 +1157,44 @@ out:
 	return rc;
 }
 
+
+/*
+ * node_demotion[] example:
+ *
+ * Consider a system with two sockets.  Each socket has
+ * three classes of memory attached: fast, medium and slow.
+ * Each memory class is placed in its own NUMA node.  The
+ * CPUs are placed in the node with the "fast" memory.  The
+ * 6 NUMA nodes (0-5) might be split among the sockets like
+ * this:
+ *
+ *	Socket A: 0, 1, 2
+ *	Socket B: 3, 4, 5
+ *
+ * When Node 0 fills up, its memory should be migrated to
+ * Node 1.  When Node 1 fills up, it should be migrated to
+ * Node 2.  The migration path start on the nodes with the
+ * processors (since allocations default to this node) and
+ * fast memory, progress through medium and end with the
+ * slow memory:
+ *
+ *	0 -> 1 -> 2 -> stop
+ *	3 -> 4 -> 5 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ *
+ *	{  1, // Node 0 migrates to 1
+ *	   2, // Node 1 migrates to 2
+ *	  -1, // Node 2 does not migrate
+ *	   4, // Node 3 migrates to 1
+ *	   5, // Node 4 migrates to 2
+ *	  -1} // Node 5 does not migrate
+ */
+
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
@@ -1171,7 +1209,13 @@ static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
-	return node_demotion[node];
+	/*
+	 * node_demotion[] is updated without excluding
+	 * this function from running.  READ_ONCE() avoids
+	 * reading multiple, inconsistent 'node' values
+	 * during an update.
+	 */
+	return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3175,3 +3219,131 @@ void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+static void __set_migration_target_nodes(void)
+{
+	nodemask_t next_pass	= NODE_MASK_NONE;
+	nodemask_t this_pass	= NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.  Do this only
+	 * once per pass so that multiple source nodes can
+	 * share a target node.
+	 *
+	 * 'used_targets' will become unavailable in future
+	 * passes.  This limits some opportunities for
+	 * multiple source nodes to share a destination.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+}
+
+/*
+ * For callers that do not hold get_online_mems() already.
+ */
+static void set_migration_target_nodes(void)
+{
+	get_online_mems();
+	__set_migration_target_nodes();
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2021-03-04 15:35:52.422806439 -0800
+++ b/mm/page_alloc.c	2021-03-04 15:35:52.429806439 -0800
@@ -3916,7 +3916,7 @@ retry:
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (!node_reclaim_enabled() ||
+			if (node_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 				continue;
 
@@ -5773,7 +5773,7 @@ static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
_


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-05-07  6:14 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 18:32 [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Dave Hansen
2021-04-01 18:32 ` [PATCH 01/10] mm/numa: node demotion data structure and lookup Dave Hansen
2021-04-08  8:03   ` Oscar Salvador
2021-04-08 21:29     ` Dave Hansen
2021-04-09  5:32   ` Wei Xu
2021-04-01 18:32 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
2021-04-08  8:26   ` Oscar Salvador
2021-04-08 21:51     ` Dave Hansen
2021-04-09  8:17       ` Oscar Salvador
2021-04-10  3:07   ` Wei Xu
2021-04-14  8:08     ` Oscar Salvador
2021-04-14  8:11       ` Oscar Salvador
2021-04-14  8:12       ` David Hildenbrand
2021-04-14  8:14         ` Oscar Salvador
2021-04-14  8:20           ` David Hildenbrand
2021-04-15  4:07       ` Wei Xu
2021-04-15 15:35         ` Dave Hansen
2021-04-15 20:25           ` Wei Xu
2021-04-01 18:32 ` [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events Dave Hansen
2021-04-08  9:52   ` Oscar Salvador
2021-04-09 10:14     ` Oscar Salvador
2021-04-09 10:15       ` Oscar Salvador
2021-04-09 18:59       ` David Hildenbrand
2021-04-12  7:19         ` Oscar Salvador
2021-04-12  9:19           ` David Hildenbrand
2021-04-01 18:32 ` [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded Dave Hansen
2021-04-01 22:35   ` Wei Xu
2021-04-01 23:21     ` Dave Hansen
2021-04-01 22:39   ` Wei Xu
2021-04-08 10:14   ` Oscar Salvador
2021-04-08 17:26     ` Yang Shi
2021-04-08 18:17       ` Oscar Salvador
2021-04-08 18:21         ` Oscar Salvador
2021-04-08 20:40         ` Yang Shi
2021-04-09  5:06           ` Oscar Salvador
2021-04-09  5:43             ` Wei Xu
2021-04-09 15:43             ` Yang Shi
2021-04-09 15:50         ` Dave Hansen
2021-04-09 18:47           ` Wei Xu
2021-04-09 20:10           ` Yang Shi
2021-04-01 18:32 ` [PATCH 05/10] mm/migrate: demote pages during reclaim Dave Hansen
2021-04-01 20:01   ` Yang Shi
2021-04-01 22:58     ` Dave Hansen
2021-04-08 10:47   ` Oscar Salvador
2021-04-10  3:35   ` Wei Xu
2021-04-01 18:32 ` [PATCH 06/10] mm/vmscan: add page demotion counter Dave Hansen
2021-04-10  3:40   ` Wei Xu
2021-04-01 18:32 ` [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages Dave Hansen
2021-04-07 18:40   ` Wei Xu
2021-04-09  8:31     ` Oscar Salvador
2021-04-01 18:32 ` [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2021-04-02  0:55   ` Wei Xu
2021-04-01 18:32 ` [PATCH 09/10] mm/vmscan: never demote for memcg reclaim Dave Hansen
2021-04-02  0:18   ` Wei Xu
2021-04-01 18:32 ` [PATCH 10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration Dave Hansen
2021-04-01 20:06   ` Yang Shi
2021-04-10  4:10   ` Wei Xu
2021-04-16 12:35 ` [PATCH 00/10] [v7][RESEND] Migrate Pages in lieu of discard Michal Hocko
2021-04-16 14:26   ` Dave Hansen
2021-04-16 15:02     ` Michal Hocko
2021-04-21  2:39       ` Huang, Ying
2021-05-07  6:14       ` Huang, Ying
  -- strict thread matches above, loose matches on Subject: below --
2021-03-04 23:59 [PATCH 00/10] [v6] " Dave Hansen
2021-03-04 23:59 ` [PATCH 02/10] mm/numa: automatically generate node migration order Dave Hansen
2021-03-08 23:59   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).