[02/10] mm/numa: automatically generate node migration order
diff mbox series

Message ID 20210401183219.DC1928FA@viggo.jf.intel.com
State New, archived
Headers show
Series
  • Migrate Pages in lieu of discard
Related show

Commit Message

Dave Hansen April 1, 2021, 6:32 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list would
be to follow the page allocator fallback lists.  Those lists
already tell us if memory is full where to look next.  It would
also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to migration
cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes from 20200122:
 * Add big node_demotion[] comment
Changes from 20210302:
 * Fix typo in node_demotion[] comment
---

 b/mm/internal.h   |    5 +
 b/mm/migrate.c    |  175 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    2 
 3 files changed, 180 insertions(+), 2 deletions(-)

Comments

Oscar Salvador April 8, 2021, 8:26 a.m. UTC | #1
On Thu, Apr 01, 2021 at 11:32:19AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
> 
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists.  Those lists
> already tell us if memory is full where to look next.  It would
> also be logical to move memory in that order.
> 
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
> 
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
> 
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
> 
> The protocol for node_demotion[] access and writing is not
> standard.  It has no specific locking and is intended to be read
> locklessly.  Readers must take care to avoid observing changes
> that appear incoherent.  This was done so that node_demotion[]

It might be just me being dense here, but that reads odd.

"Readers must take care to avoid observing changes that appear
incoherent" - I am not sure what is that supposed to mean.

I guess you mean readers of next_demotion_node()?
And if so, how do they have to take care? And what would apply for
"incoherent" terminology here?

> locking has no chance of becoming a bottleneck on large systems
> with lots of CPUs in direct reclaim.
> 
> This code is unused for now.  It will be called later in the
> series.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>

...

> +static void __set_migration_target_nodes(void)
> +{
> +	nodemask_t next_pass	= NODE_MASK_NONE;
> +	nodemask_t this_pass	= NODE_MASK_NONE;
> +	nodemask_t used_targets = NODE_MASK_NONE;
> +	int node;
> +
> +	/*
> +	 * Avoid any oddities like cycles that could occur
> +	 * from changes in the topology.  This will leave
> +	 * a momentary gap when migration is disabled.
> +	 */
> +	disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 *
> +	 * The before+after state together might have cycles and
> +	 * could cause readers to do things like loop until this
> +	 * function finishes.  This ensures they can only see a
> +	 * single "bad" read and would, for instance, only loop
> +	 * once.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Allocations go close to CPUs, first.  Assume that
> +	 * the migration path starts at the nodes with CPUs.
> +	 */
> +	next_pass = node_states[N_CPU];
> +again:
> +	this_pass = next_pass;
> +	next_pass = NODE_MASK_NONE;
> +	/*
> +	 * To avoid cycles in the migration "graph", ensure
> +	 * that migration sources are not future targets by
> +	 * setting them in 'used_targets'.  Do this only
> +	 * once per pass so that multiple source nodes can
> +	 * share a target node.
> +	 *
> +	 * 'used_targets' will become unavailable in future
> +	 * passes.  This limits some opportunities for
> +	 * multiple source nodes to share a destination.
> +	 */
> +	nodes_or(used_targets, used_targets, this_pass);
> +	for_each_node_mask(node, this_pass) {
> +		int target_node = establish_migrate_target(node, &used_targets);
> +
> +		if (target_node == NUMA_NO_NODE)
> +			continue;
> +
> +		/* Visit targets from this pass in the next pass: */
> +		node_set(target_node, next_pass);
> +	}
> +	/* Is another pass necessary? */
> +	if (!nodes_empty(next_pass))

When I read this I was about puzzled and it took me a while to figure
out how the passes were made.
I think this could benefit from a better explanation on how the passes
are being performed e.g: why next_pass should be empty before leaving.

Other than that looks good to me.
Dave Hansen April 8, 2021, 9:51 p.m. UTC | #2
On 4/8/21 1:26 AM, Oscar Salvador wrote:
> On Thu, Apr 01, 2021 at 11:32:19AM -0700, Dave Hansen wrote:
>> The protocol for node_demotion[] access and writing is not
>> standard.  It has no specific locking and is intended to be read
>> locklessly.  Readers must take care to avoid observing changes
>> that appear incoherent.  This was done so that node_demotion[]
> 
> It might be just me being dense here, but that reads odd.
> 
> "Readers must take care to avoid observing changes that appear
> incoherent" - I am not sure what is that supposed to mean.
> 
> I guess you mean readers of next_demotion_node()?
> And if so, how do they have to take care? And what would apply for
> "incoherent" terminology here?

I've fleshed out the description a bit.  I hope this helps?

> Readers of node_demotion[] (such as next_demotion_node() callers)
> must take care to avoid observing changes that appear incoherent.
> For instance, even though no demotion cycles are allowed, it's
> possible that a cycle could be observed.
> 
> Let's say that there are three nodes, A, B and C.  node_demotion[]
> is set up to have this path:
>         
>         A -> B -> C
> 
> Suppose it was modified to instead represent this path:
> 
>         A -> C -> B
> 
> There is nothing to stop a reader from seeing B->C and then a
> moment later seeting C->B.  That *appears* to be a cycle.  This
> can be avoided with RCU and will be implemented in a later patch.

...
>> +again:
>> +	this_pass = next_pass;
>> +	next_pass = NODE_MASK_NONE;
>> +	/*
>> +	 * To avoid cycles in the migration "graph", ensure
>> +	 * that migration sources are not future targets by
>> +	 * setting them in 'used_targets'.  Do this only
>> +	 * once per pass so that multiple source nodes can
>> +	 * share a target node.
>> +	 *
>> +	 * 'used_targets' will become unavailable in future
>> +	 * passes.  This limits some opportunities for
>> +	 * multiple source nodes to share a destination.
>> +	 */
>> +	nodes_or(used_targets, used_targets, this_pass);
>> +	for_each_node_mask(node, this_pass) {
>> +		int target_node = establish_migrate_target(node, &used_targets);
>> +
>> +		if (target_node == NUMA_NO_NODE)
>> +			continue;
>> +
>> +		/* Visit targets from this pass in the next pass: */
>> +		node_set(target_node, next_pass);
>> +	}
>> +	/* Is another pass necessary? */
>> +	if (!nodes_empty(next_pass))
> 
> When I read this I was about puzzled and it took me a while to figure
> out how the passes were made.
> I think this could benefit from a better explanation on how the passes
> are being performed e.g: why next_pass should be empty before leaving.
> 
> Other than that looks good to me.
I've tried to flesh out those comments to elaborate on what is going on:

>                 /*
>                  * Visit targets from this pass in the next pass.
>                  * Eventually, every node will have been part of
>                  * a pass, and will become set in 'used_targets'.
>                  */
>                 node_set(target_node, next_pass);
>         }
>         /*
>          * 'next_pass' contains nodes which became migration
>          * targets in this pass.  Make additional passes until
>          * no more migrations targets are available.
>          */
>         if (!nodes_empty(next_pass))
>                 goto again;
> }
Oscar Salvador April 9, 2021, 8:17 a.m. UTC | #3
On Thu, Apr 08, 2021 at 02:51:20PM -0700, Dave Hansen wrote:
> I've fleshed out the description a bit.  I hope this helps?

Yes, thanks Dave, both additions look fine to me.
Wei Xu April 10, 2021, 3:07 a.m. UTC | #4
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> +/*
> + * node_demotion[] example:
> + *
> + * Consider a system with two sockets.  Each socket has
> + * three classes of memory attached: fast, medium and slow.
> + * Each memory class is placed in its own NUMA node.  The
> + * CPUs are placed in the node with the "fast" memory.  The
> + * 6 NUMA nodes (0-5) might be split among the sockets like
> + * this:
> + *
> + *     Socket A: 0, 1, 2
> + *     Socket B: 3, 4, 5
> + *
> + * When Node 0 fills up, its memory should be migrated to
> + * Node 1.  When Node 1 fills up, it should be migrated to
> + * Node 2.  The migration path start on the nodes with the
> + * processors (since allocations default to this node) and
> + * fast memory, progress through medium and end with the
> + * slow memory:
> + *
> + *     0 -> 1 -> 2 -> stop
> + *     3 -> 4 -> 5 -> stop
> + *
> + * This is represented in the node_demotion[] like this:
> + *
> + *     {  1, // Node 0 migrates to 1
> + *        2, // Node 1 migrates to 2
> + *       -1, // Node 2 does not migrate
> + *        4, // Node 3 migrates to 4
> + *        5, // Node 4 migrates to 5
> + *       -1} // Node 5 does not migrate
> + */

In this example, if we want to support multiple nodes as the demotion
target of a source node, we can group these nodes into three tiers
(classes):

fast class:
0 -> {1, 4}  // 1 is the preferred
3 -> {4, 1}  // 4 is the preferred

medium class:
1 -> {2, 5}  // 2 is the preferred
4 -> {5, 2}  // 5 is the preferred

slow class:
2 -> stop
5 -> stop

This can guarantee there are no cycles, either.  Does it sound sensible?

> +again:
> +       this_pass = next_pass;
> +       next_pass = NODE_MASK_NONE;
> +       /*
> +        * To avoid cycles in the migration "graph", ensure
> +        * that migration sources are not future targets by
> +        * setting them in 'used_targets'.  Do this only
> +        * once per pass so that multiple source nodes can
> +        * share a target node.
> +        *
> +        * 'used_targets' will become unavailable in future
> +        * passes.  This limits some opportunities for
> +        * multiple source nodes to share a destination.
> +        */
> +       nodes_or(used_targets, used_targets, this_pass);
> +       for_each_node_mask(node, this_pass) {
> +               int target_node = establish_migrate_target(node, &used_targets);
> +
> +               if (target_node == NUMA_NO_NODE)
> +                       continue;
> +
> +               /* Visit targets from this pass in the next pass: */
> +               node_set(target_node, next_pass);
> +       }
> +       /* Is another pass necessary? */
> +       if (!nodes_empty(next_pass))
> +               goto again;

This goto seems like exactly a "do {} while" loop.  Any particular reason not to
use "do {} while" here?
Oscar Salvador April 14, 2021, 8:08 a.m. UTC | #5
On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote:
> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > + * When Node 0 fills up, its memory should be migrated to
> > + * Node 1.  When Node 1 fills up, it should be migrated to
> > + * Node 2.  The migration path start on the nodes with the
> > + * processors (since allocations default to this node) and
> > + * fast memory, progress through medium and end with the
> > + * slow memory:
> > + *
> > + *     0 -> 1 -> 2 -> stop
> > + *     3 -> 4 -> 5 -> stop
> > + *
> > + * This is represented in the node_demotion[] like this:
> > + *
> > + *     {  1, // Node 0 migrates to 1
> > + *        2, // Node 1 migrates to 2
> > + *       -1, // Node 2 does not migrate
> > + *        4, // Node 3 migrates to 4
> > + *        5, // Node 4 migrates to 5
> > + *       -1} // Node 5 does not migrate
> > + */
> 
> In this example, if we want to support multiple nodes as the demotion
> target of a source node, we can group these nodes into three tiers
> (classes):
> 
> fast class:
> 0 -> {1, 4}  // 1 is the preferred
> 3 -> {4, 1}  // 4 is the preferred
> 
> medium class:
> 1 -> {2, 5}  // 2 is the preferred
> 4 -> {5, 2}  // 5 is the preferred
> 
> slow class:
> 2 -> stop
> 5 -> stop

Hi Wei Xu,

I have some questions about it

Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
are PMEM, right?
Then, what stands for medium class/memory?

In Dave's example, list is created in a way that stays local to the socket,
and we go from the fast one to the slow one.
In yours, lists are created taking the fastest nodes from all sockets and
we work our way down, which means have cross-socket nodes in the list.
How much of a penalty is that?

And while I get your point, I am not sure if that is what we pretend here.
This patchset aims to place cold pages that are about to be reclaim in slower
nodes to give them a second chance, while your design seems more to have kind
of different memory clases and be able to place applications in one of those tiers
depending on its demands or sysadmin-demand.

Could you expand some more?
Oscar Salvador April 14, 2021, 8:11 a.m. UTC | #6
On Wed, Apr 14, 2021 at 10:08:54AM +0200, Oscar Salvador wrote:
> In Dave's example, list is created in a way that stays local to the socket,
> and we go from the fast one to the slow one.

Or maybe it is just because find_next_best_node() does not know any better
and creates the list that way?
David Hildenbrand April 14, 2021, 8:12 a.m. UTC | #7
On 14.04.21 10:08, Oscar Salvador wrote:
> On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote:
>> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>> + * When Node 0 fills up, its memory should be migrated to
>>> + * Node 1.  When Node 1 fills up, it should be migrated to
>>> + * Node 2.  The migration path start on the nodes with the
>>> + * processors (since allocations default to this node) and
>>> + * fast memory, progress through medium and end with the
>>> + * slow memory:
>>> + *
>>> + *     0 -> 1 -> 2 -> stop
>>> + *     3 -> 4 -> 5 -> stop
>>> + *
>>> + * This is represented in the node_demotion[] like this:
>>> + *
>>> + *     {  1, // Node 0 migrates to 1
>>> + *        2, // Node 1 migrates to 2
>>> + *       -1, // Node 2 does not migrate
>>> + *        4, // Node 3 migrates to 4
>>> + *        5, // Node 4 migrates to 5
>>> + *       -1} // Node 5 does not migrate
>>> + */
>>
>> In this example, if we want to support multiple nodes as the demotion
>> target of a source node, we can group these nodes into three tiers
>> (classes):
>>
>> fast class:
>> 0 -> {1, 4}  // 1 is the preferred
>> 3 -> {4, 1}  // 4 is the preferred
>>
>> medium class:
>> 1 -> {2, 5}  // 2 is the preferred
>> 4 -> {5, 2}  // 5 is the preferred
>>
>> slow class:
>> 2 -> stop
>> 5 -> stop
> 
> Hi Wei Xu,
> 
> I have some questions about it
> 
> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
> are PMEM, right?
> Then, what stands for medium class/memory?

My guest best is that fast class is something like HBM (High Bandwidth 
Memory), medium class is ordinary RAM, slow class is PMEM.
Oscar Salvador April 14, 2021, 8:14 a.m. UTC | #8
On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote:
> My guest best is that fast class is something like HBM (High Bandwidth
> Memory), medium class is ordinary RAM, slow class is PMEM.

I see, thanks for the hint David ;-)
David Hildenbrand April 14, 2021, 8:20 a.m. UTC | #9
On 14.04.21 10:14, Oscar Salvador wrote:
> On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote:
>> My guest best is that fast class is something like HBM (High Bandwidth

haha, whatever happened in that sentence: s/guest best/best guess/

>> Memory), medium class is ordinary RAM, slow class is PMEM.
> 
> I see, thanks for the hint David ;-)
>
Wei Xu April 15, 2021, 4:07 a.m. UTC | #10
On Wed, Apr 14, 2021 at 1:08 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> Hi Wei Xu,
>
> I have some questions about it
>
> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
> are PMEM, right?
> Then, what stands for medium class/memory?

That is Dave's example.  I think David's guess makes sense (HBM - fast, DRAM -
medium, PMEM - slow).  It may also be possible that we have DDR5 as fast,
CXL-DDR4 as medium, and CXL-PMEM as slow.  But the most likely use cases for
now should be just two tiers: DRAM vs PMEM or other types of slower
memory devices.

> In Dave's example, list is created in a way that stays local to the socket,
> and we go from the fast one to the slow one.
> In yours, lists are created taking the fastest nodes from all sockets and
> we work our way down, which means have cross-socket nodes in the list.
> How much of a penalty is that?

Cross-socket demotion is certainly more expensive.  But because it is
sequential access
and can also be optimized with non-temporal stores, it may not be much
slower than
demotion to a local node in the next tier.  The actual penalty will
depend on the devices.

> And while I get your point, I am not sure if that is what we pretend here.
> This patchset aims to place cold pages that are about to be reclaim in slower
> nodes to give them a second chance, while your design seems more to have kind
> of different memory clases and be able to place applications in one of those tiers
> depending on its demands or sysadmin-demand.
>
> Could you expand some more?

Sure.  What I have described has the same goal as Dave's patchset,
i,e, to demote
cold pages to the slower nodes when they are about to be reclaimed.  The only
difference is that in my suggestion the demotion target of a fast tier
node is expanded
from a single node to a set of nodes from the slow tier and one node
in such a set
can be marked as the preferred/local demotion target.   This can help
enable more
flexible demotion policies to be configured, such as to allow a cgroup
to allocate from
all fast tier nodes, but only demote to a local slow tier node.  Such
a policy can reduce
memory stranding at the fast tier (compared to if memory hardwall is
used) and still
allow demotion from all fast tier nodes without incurring the expensive random
accesses to the demoted pages if they were demoted to remote slow tier nodes.

I understand that Dave started this patchset with a simplified
demotion path definition,
which I agree.  Meanwhile, I think this more generalized definition of
demotion path
is useful and can also be important for some use cases.
Dave Hansen April 15, 2021, 3:35 p.m. UTC | #11
On 4/14/21 9:07 PM, Wei Xu wrote:
> On Wed, Apr 14, 2021 at 1:08 AM Oscar Salvador <osalvador@suse.de> wrote:
>> Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
>> are PMEM, right?
>> Then, what stands for medium class/memory?
> 
> That is Dave's example.  I think David's guess makes sense (HBM - fast, DRAM -
> medium, PMEM - slow).  It may also be possible that we have DDR5 as fast,
> CXL-DDR4 as medium, and CXL-PMEM as slow.  But the most likely use cases for
> now should be just two tiers: DRAM vs PMEM or other types of slower
> memory devices.

Yes, it would be nice to apply this to fancier tiering systems.  But
DRAM/PMEM combos are out in the wild today and it's where I expect this
to be used first.

> This can help enable more flexible demotion policies to be
> configured, such as to allow a cgroup to allocate from all fast tier
> nodes, but only demote to a local slow tier node.  Such a policy can
> reduce memory stranding at the fast tier (compared to if memory
> hardwall is used) and still allow demotion from all fast tier nodes
> without incurring the expensive random accesses to the demoted pages
> if they were demoted to remote slow tier nodes.

Could you explain this stranding effect in a bit more detail?  I'm not
quite following.
Wei Xu April 15, 2021, 8:25 p.m. UTC | #12
On Thu, Apr 15, 2021 at 8:35 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > This can help enable more flexible demotion policies to be
> > configured, such as to allow a cgroup to allocate from all fast tier
> > nodes, but only demote to a local slow tier node.  Such a policy can
> > reduce memory stranding at the fast tier (compared to if memory
> > hardwall is used) and still allow demotion from all fast tier nodes
> > without incurring the expensive random accesses to the demoted pages
> > if they were demoted to remote slow tier nodes.
>
> Could you explain this stranding effect in a bit more detail?  I'm not
> quite following.

By memory stranding, I mean that memory on a machine (or a NUMA node)
cannot be utilized even under extremely high work loads.  Memory
stranding happens usually due to mismatches between job/machine
shapes as well as resource fragmentation resulted from bin-packing
scheduling.  It is an important problem for cloud resource efficiency.

If NUMA hardwalling is used, we effectively split a single machine
into multiple smaller machines based on NUMA nodes.  This changes the
machine shapes and also makes memory more fragmented, which can lead
to more memory being stranded.

Here is a simple example:  Suppose that each machine has 2 NUMA nodes,
each with 4 cores and 5GB RAM, and all the jobs have the shape of
2 CPUs and 3GB memory.  Without NUMA memory hardwalling, we can pack
3 jobs onto each machine, which leaves 1GB memory and 2 cores in
stranding.  However, with NUMA memory hardwalling enabled, we can then
only pack 2 jobs onto each machine (one job on each NUMA node), which
increases the resource stranding to 4GB memory and 4 cores.

Patch
diff mbox series

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.794000261 -0700
+++ b/mm/internal.h	2021-03-31 15:17:11.816000261 -0700
@@ -515,12 +515,17 @@  static inline void mminit_validate_memmo
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
 				unsigned int order)
 {
 	return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.798000261 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:11.821000261 -0700
@@ -1163,6 +1163,44 @@  out:
 	return rc;
 }
 
+
+/*
+ * node_demotion[] example:
+ *
+ * Consider a system with two sockets.  Each socket has
+ * three classes of memory attached: fast, medium and slow.
+ * Each memory class is placed in its own NUMA node.  The
+ * CPUs are placed in the node with the "fast" memory.  The
+ * 6 NUMA nodes (0-5) might be split among the sockets like
+ * this:
+ *
+ *	Socket A: 0, 1, 2
+ *	Socket B: 3, 4, 5
+ *
+ * When Node 0 fills up, its memory should be migrated to
+ * Node 1.  When Node 1 fills up, it should be migrated to
+ * Node 2.  The migration path start on the nodes with the
+ * processors (since allocations default to this node) and
+ * fast memory, progress through medium and end with the
+ * slow memory:
+ *
+ *	0 -> 1 -> 2 -> stop
+ *	3 -> 4 -> 5 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ *
+ *	{  1, // Node 0 migrates to 1
+ *	   2, // Node 1 migrates to 2
+ *	  -1, // Node 2 does not migrate
+ *	   4, // Node 3 migrates to 4
+ *	   5, // Node 4 migrates to 5
+ *	  -1} // Node 5 does not migrate
+ */
+
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
@@ -1177,7 +1215,13 @@  static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
-	return node_demotion[node];
+	/*
+	 * node_demotion[] is updated without excluding
+	 * this function from running.  READ_ONCE() avoids
+	 * reading multiple, inconsistent 'node' values
+	 * during an update.
+	 */
+	return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3181,3 +3225,132 @@  void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+static void __set_migration_target_nodes(void)
+{
+	nodemask_t next_pass	= NODE_MASK_NONE;
+	nodemask_t this_pass	= NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.  Do this only
+	 * once per pass so that multiple source nodes can
+	 * share a target node.
+	 *
+	 * 'used_targets' will become unavailable in future
+	 * passes.  This limits some opportunities for
+	 * multiple source nodes to share a destination.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+}
+
+/*
+ * For callers that do not hold get_online_mems() already.
+ */
+__maybe_unused // <- temporay to prevent warnings during bisects
+static void set_migration_target_nodes(void)
+{
+	get_online_mems();
+	__set_migration_target_nodes();
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2021-03-31 15:17:11.811000261 -0700
+++ b/mm/page_alloc.c	2021-03-31 15:17:11.826000261 -0700
@@ -5780,7 +5780,7 @@  static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;