From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=MVzO=AL=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	HTML_MESSAGE,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,
	UNPARSEABLE_RELAY,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B1BA2C433E0
	for <linux-mm@archiver.kernel.org>; Tue, 30 Jun 2020 17:50:56 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1C94B20768
	for <linux-mm@archiver.kernel.org>; Tue, 30 Jun 2020 17:50:56 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1C94B20768
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8F46F6B0002; Tue, 30 Jun 2020 13:50:55 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A4DB6B0003; Tue, 30 Jun 2020 13:50:55 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 76B956B0006; Tue, 30 Jun 2020 13:50:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0246.hostedemail.com [216.40.44.246])
	by kanga.kvack.org (Postfix) with ESMTP id 5D1306B0002
	for <linux-mm@kvack.org>; Tue, 30 Jun 2020 13:50:55 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id BD439181AC9CB
	for <linux-mm@kvack.org>; Tue, 30 Jun 2020 17:50:54 +0000 (UTC)
X-FDA: 76986618828.27.girls81_1904f6b26e79
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin27.hostedemail.com (Postfix) with ESMTP id 7E9403D663
	for <linux-mm@kvack.org>; Tue, 30 Jun 2020 17:50:54 +0000 (UTC)
X-HE-Tag: girls81_1904f6b26e79
X-Filterd-Recvd-Size: 24623
Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43])
	by imf07.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 30 Jun 2020 17:50:52 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0U1CLbo3_1593539445;
Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0U1CLbo3_1593539445)
          by smtp.aliyun-inc.com(127.0.0.1);
          Wed, 01 Jul 2020 01:50:47 +0800
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable
 reclaim-based migration
To: "Huang, Ying" <ying.huang@intel.com>,
 Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, rientjes@google.com,
 dan.j.williams@intel.com
References: <20200629234503.749E5340@viggo.jf.intel.com>
 <20200629234517.A7EC4BD3@viggo.jf.intel.com>
 <87v9j9ow3a.fsf@yhuang-dev.intel.com>
From: Yang Shi <yang.shi@linux.alibaba.com>
Message-ID: <29c67873-3cb9-e121-382c-9b81491016bc@linux.alibaba.com>
Date: Tue, 30 Jun 2020 10:50:30 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <87v9j9ow3a.fsf@yhuang-dev.intel.com>
Content-Type: multipart/alternative;
 boundary="------------5429574D218099344F0AC18B"
Content-Language: en-US
X-Rspamd-Queue-Id: 7E9403D663
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam03
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This is a multi-part message in MIME format.
--------------5429574D218099344F0AC18B
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit


On 6/30/20 12:23 AM, Huang, Ying wrote:
> Hi, Dave,
>
> Dave Hansen <dave.hansen@linux.intel.com> writes:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Some method is obviously needed to enable reclaim-based migration.
>>
>> Just like traditional autonuma, there will be some workloads that
>> will benefit like workloads with more "static" configurations where
>> hot pages stay hot and cold pages stay cold.  If pages come and go
>> from the hot and cold sets, the benefits of this approach will be
>> more limited.
>>
>> The benefits are truly workload-based and *not* hardware-based.
>> We do not believe that there is a viable threshold where certain
>> hardware configurations should have this mechanism enabled while
>> others do not.
>>
>> To be conservative, earlier work defaulted to disable reclaim-
>> based migration and did not include a mechanism to enable it.
>> This propses extending the existing "zone_reclaim_mode" (now
>> now really node_reclaim_mode) as a method to enable it.
>>
>> We are open to any alternative that allows end users to enable
>> this mechanism or disable it it workload harm is detected (just
>> like traditional autonuma).
>>
>> The implementation here is pretty simple and entirely unoptimized.
>> On any memory hotplug events, assume that a node was added or
>> removed and recalculate all migration targets.  This ensures that
>> the node_demotion[] array is always ready to be used in case the
>> new reclaim mode is enabled.  This recalculation is far from
>> optimal, most glaringly that it does not even attempt to figure
>> out if nodes are actually coming or going.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>>   b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>>   b/mm/vmscan.c                             |    7 +--
>>   3 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
>> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
>> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
>> @@ -941,6 +941,7 @@ This is value OR'ed together of
>>   1	(bit currently ignored)
>>   2	Zone reclaim writes dirty pages out
>>   4	Zone reclaim swaps pages
>> +8	Zone reclaim migrates pages
>>   =	===================================
>>   
>>   zone_reclaim_mode is disabled by default.  For file servers or workloads
>> @@ -965,3 +966,11 @@ of other processes running on other node
>>   Allowing regular swap effectively restricts allocations to the local
>>   node unless explicitly overridden by memory policies or cpuset
>>   configurations.
>> +
>> +Page migration during reclaim is intended for systems with tiered memory
>> +configurations.  These systems have multiple types of memory with varied
>> +performance characteristics instead of plain NUMA systems where the same
>> +kind of memory is found at varied distances.  Allowing page migration
>> +during reclaim enables these systems to migrate pages from fast tiers to
>> +slow tiers when the fast tier is under pressure.  This migration is
>> +performed before swap.
>> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
>> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
>> @@ -49,6 +49,7 @@
>>   #include <linux/sched/mm.h>
>>   #include <linux/ptrace.h>
>>   #include <linux/oom.h>
>> +#include <linux/memory.h>
>>   
>>   #include <asm/tlbflush.h>
>>   
>> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>>   	 * Avoid any oddities like cycles that could occur
>>   	 * from changes in the topology.  This will leave
>>   	 * a momentary gap when migration is disabled.
>> +	 *
>> +	 * This is superfluous for memory offlining since
>> +	 * MEM_GOING_OFFLINE does it independently, but it
>> +	 * does not hurt to do it a second time.
>>   	 */
>>   	disable_all_migrate_targets();
>>   
>> @@ -3211,6 +3216,60 @@ again:
>>   	/* Is another pass necessary? */
>>   	if (!nodes_empty(next_pass))
>>   		goto again;
>> +}
>>   
>> -	put_online_mems();
>> +/*
>> + * React to hotplug events that might online or offline
>> + * NUMA nodes.
>> + *
>> + * This leaves migrate-on-reclaim transiently disabled
>> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
>> + * This runs whether RECLAIM_MIGRATE is enabled or not.
>> + * That ensures that the user can turn RECLAIM_MIGRATE
>> + * without needing to recalcuate migration targets.
>> + */
>> +#if defined(CONFIG_MEMORY_HOTPLUG)
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> +						 unsigned long action, void *arg)
>> +{
>> +	switch (action) {
>> +	case MEM_GOING_OFFLINE:
>> +		/*
>> +		 * Make sure there are not transient states where
>> +		 * an offline node is a migration target.  This
>> +		 * will leave migration disabled until the offline
>> +		 * completes and the MEM_OFFLINE case below runs.
>> +		 */
>> +		disable_all_migrate_targets();
>> +		break;
>> +	case MEM_OFFLINE:
>> +	case MEM_ONLINE:
>> +		/*
>> +		 * Recalculate the target nodes once the node
>> +		 * reaches its final state (online or offline).
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_CANCEL_OFFLINE:
>> +		/*
>> +		 * MEM_GOING_OFFLINE disabled all the migration
>> +		 * targets.  Reenable them.
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_GOING_ONLINE:
>> +	case MEM_CANCEL_ONLINE:
>> +		break;
>> +	}
>> +
>> +	return notifier_from_errno(0);
>>   }
>> +
>> +static int __init migrate_on_reclaim_init(void)
>> +{
>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>> +	return 0;
>> +}
>> +late_initcall(migrate_on_reclaim_init);
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>> +
>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>    * ABI.  New bits are OK, but existing bits can never change.
>>    */
>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>   
>>   /*
>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
> I found that RECLAIM_MIGRATE is defined but never referenced in the
> patch.
>
> If my understanding of the code were correct, shrink_do_demote_mapping()
> is called by shrink_page_list(), which is used by kswapd and direct
> reclaim.  So as long as the persistent memory node is onlined,
> reclaim-based migration will be enabled regardless of node reclaim mode.

It looks so according to the code. But the intention of a new node 
reclaim mode is to do migration on reclaim *only when* the RECLAIM_MODE 
is enabled by the users.

It looks the patch just clear the migration target node masks if the 
memory is offlined.

So, I'm supposed you need check if node_reclaim is enabled before doing 
migration in shrink_page_list() and also need make node reclaim to adopt 
the new mode.

Please refer to 
https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/

I copied the related chunks here:

+ if (is_demote_ok(page_to_nid(page))) { <--- check if node reclaim is 
enabled + list_add(&page->lru, &demote_pages); + unlock_page(page); + 
continue; + } and @@ -4084,8 +4179,10 @@ static int 
__node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in   		.gfp_mask = current_gfp_context(gfp_mask),
  		.order = order,
  		.priority = NODE_RECLAIM_PRIORITY,
- .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = 
!!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = 
!!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & 
RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) 
|| + (node_reclaim_mode & RECLAIM_MIGRATE)),   		.may_swap = 1,
  		.reclaim_idx = gfp_zone(gfp_mask),
  	};
@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data 
*pgdat, gfp_t gfp_mask, unsigned in   	reclaim_state.reclaimed_slab = 0;
  	p->reclaim_state = &reclaim_state;
  
- if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + 
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + 
(node_reclaim_mode & RECLAIM_MIGRATE)) {   		/*
  		 * Free memory by calling shrink node with increasing
  		 * priorities until we have enough memory freed.
@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t 
gfp_mask, unsigned int order)   	 * thrown out if the node is overallocated. So we do not reclaim
  	 * if less than a specified percentage of the node is used by
  	 * unmapped file backed pages.
+ * + * Migrate mode doesn't care the above restrictions.   	 */
  	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
- node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 
+ node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages 
&& + !(node_reclaim_mode & RECLAIM_MIGRATE))   		return NODE_RECLAIM_FULL;

>
> Best Regards,
> Huang, Ying


--------------5429574D218099344F0AC18B
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 6/30/20 12:23 AM, Huang, Ying wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:87v9j9ow3a.fsf@yhuang-dev.intel.com">
      <pre wrap="">Hi, Dave,

Dave Hansen <a class="moz-txt-link-rfc2396E" href="mailto:dave.hansen@linux.intel.com">&lt;dave.hansen@linux.intel.com&gt;</a> writes:

</pre>
      <blockquote type="cite">
        <pre wrap="">From: Dave Hansen <a class="moz-txt-link-rfc2396E" href="mailto:dave.hansen@linux.intel.com">&lt;dave.hansen@linux.intel.com&gt;</a>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled.  This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <a class="moz-txt-link-rfc2396E" href="mailto:dave.hansen@linux.intel.com">&lt;dave.hansen@linux.intel.com&gt;</a>
Cc: Yang Shi <a class="moz-txt-link-rfc2396E" href="mailto:yang.shi@linux.alibaba.com">&lt;yang.shi@linux.alibaba.com&gt;</a>
Cc: David Rientjes <a class="moz-txt-link-rfc2396E" href="mailto:rientjes@google.com">&lt;rientjes@google.com&gt;</a>
Cc: Huang Ying <a class="moz-txt-link-rfc2396E" href="mailto:ying.huang@intel.com">&lt;ying.huang@intel.com&gt;</a>
Cc: Dan Williams <a class="moz-txt-link-rfc2396E" href="mailto:dan.j.williams@intel.com">&lt;dan.j.williams@intel.com&gt;</a>
---

 b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
 b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
 b/mm/vmscan.c                             |    7 +--
 3 files changed, 73 insertions(+), 4 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@ This is value OR'ed together of
 1	(bit currently ignored)
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -965,3 +966,11 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@
 #include &lt;linux/sched/mm.h&gt;
 #include &lt;linux/ptrace.h&gt;
 #include &lt;linux/oom.h&gt;
+#include &lt;linux/memory.h&gt;
 
 #include &lt;asm/tlbflush.h&gt;
 
@@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
 	 * Avoid any oddities like cycles that could occur
 	 * from changes in the topology.  This will leave
 	 * a momentary gap when migration is disabled.
+	 *
+	 * This is superfluous for memory offlining since
+	 * MEM_GOING_OFFLINE does it independently, but it
+	 * does not hurt to do it a second time.
 	 */
 	disable_all_migrate_targets();
 
@@ -3211,6 +3216,60 @@ again:
 	/* Is another pass necessary? */
 	if (!nodes_empty(next_pass))
 		goto again;
+}
 
-	put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
 }
+
+static int __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
  * ABI.  New bits are OK, but existing bits can never change.
  */
-#define RECLAIM_RSVD  (1&lt;&lt;0)	/* (currently ignored/unused) */
-#define RECLAIM_WRITE (1&lt;&lt;1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1&lt;&lt;2)	/* Unmap pages during reclaim */
+#define RECLAIM_RSVD	(1&lt;&lt;0)	/* (currently ignored/unused) */
+#define RECLAIM_WRITE	(1&lt;&lt;1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP	(1&lt;&lt;2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1&lt;&lt;3)	/* Migrate pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
</pre>
      </blockquote>
      <pre wrap="">
I found that RECLAIM_MIGRATE is defined but never referenced in the
patch.

If my understanding of the code were correct, shrink_do_demote_mapping()
is called by shrink_page_list(), which is used by kswapd and direct
reclaim.  So as long as the persistent memory node is onlined,
reclaim-based migration will be enabled regardless of node reclaim mode.</pre>
    </blockquote>
    <br>
    It looks so according to the code. But the intention of a new node
    reclaim mode is to do migration on reclaim *only when* the
    RECLAIM_MODE is enabled by the users.<br>
    <br>
    It looks the patch just clear the migration target node masks if the
    memory is offlined.<br>
    <br>
    So, I'm supposed you need check if node_reclaim is enabled before
    doing migration in shrink_page_list() and also need make node
    reclaim to adopt the new mode.<br>
    <br>
    Please refer to <a
href="https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/">https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/</a><br>
    <br>
    I copied the related chunks here:<br>
    <br>
    <pre id="b" style="font-size: 13px; font-family: monospace; white-space: pre-wrap; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"><span class="add" style="font-size: 13px; font-family: monospace;">+				if (is_demote_ok(page_to_nid(page))) { &lt;--- check if node reclaim is enabled
+					list_add(&amp;page-&gt;lru, &amp;demote_pages);
+					unlock_page(page);
+					continue;
+				}

and

</span><span class="hunk" style="font-size: 13px; font-family: monospace;">@@ -4084,8 +4179,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
</span> 		.gfp_mask = current_gfp_context(gfp_mask),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
<span class="del" style="font-size: 13px; font-family: monospace;">-		.may_writepage = !!(node_reclaim_mode &amp; RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode &amp; RECLAIM_UNMAP),
</span><span class="add" style="font-size: 13px; font-family: monospace;">+		.may_writepage = !!((node_reclaim_mode &amp; RECLAIM_WRITE) ||
+				    (node_reclaim_mode &amp; RECLAIM_MIGRATE)),
+		.may_unmap = !!((node_reclaim_mode &amp; RECLAIM_UNMAP) ||
+				(node_reclaim_mode &amp; RECLAIM_MIGRATE)),
</span> 		.may_swap = 1,
 		.reclaim_idx = gfp_zone(gfp_mask),
 	};
<span class="hunk" style="font-size: 13px; font-family: monospace;">@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
</span> 	reclaim_state.reclaimed_slab = 0;
 	p-&gt;reclaim_state = &amp;reclaim_state;
 
<span class="del" style="font-size: 13px; font-family: monospace;">-	if (node_pagecache_reclaimable(pgdat) &gt; pgdat-&gt;min_unmapped_pages) {
</span><span class="add" style="font-size: 13px; font-family: monospace;">+	if (node_pagecache_reclaimable(pgdat) &gt; pgdat-&gt;min_unmapped_pages ||
+	    (node_reclaim_mode &amp; RECLAIM_MIGRATE)) {
</span> 		/*
 		 * Free memory by calling shrink node with increasing
 		 * priorities until we have enough memory freed.
<span class="hunk" style="font-size: 13px; font-family: monospace;">@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
</span> 	 * thrown out if the node is overallocated. So we do not reclaim
 	 * if less than a specified percentage of the node is used by
 	 * unmapped file backed pages.
<span class="add" style="font-size: 13px; font-family: monospace;">+	 *
+	 * Migrate mode doesn't care the above restrictions.
</span> 	 */
 	if (node_pagecache_reclaimable(pgdat) &lt;= pgdat-&gt;min_unmapped_pages &amp;&amp;
<span class="del" style="font-size: 13px; font-family: monospace;">-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) &lt;= pgdat-&gt;min_slab_pages)
</span><span class="add" style="font-size: 13px; font-family: monospace;">+	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) &lt;= pgdat-&gt;min_slab_pages &amp;&amp;
+	    !(node_reclaim_mode &amp; RECLAIM_MIGRATE))
</span> 		return NODE_RECLAIM_FULL;

<span class="add" style="font-size: 13px; font-family: monospace;"></span></pre>
    <blockquote type="cite"
      cite="mid:87v9j9ow3a.fsf@yhuang-dev.intel.com">
      <pre wrap="">

Best Regards,
Huang, Ying
</pre>
    </blockquote>
    <br>
  </body>
</html>

--------------5429574D218099344F0AC18B--