Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

From: Yang Shi <yang.shi@linux.alibaba.com>
To: "Huang, Ying" <ying.huang@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	rientjes@google.com, dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
Date: Tue, 30 Jun 2020 10:50:30 -0700	[thread overview]
Message-ID: <29c67873-3cb9-e121-382c-9b81491016bc@linux.alibaba.com> (raw)
In-Reply-To: <87v9j9ow3a.fsf@yhuang-dev.intel.com>

[-- Attachment #1: Type: text/plain, Size: 9959 bytes --]



On 6/30/20 12:23 AM, Huang, Ying wrote:
> Hi, Dave,
>
> Dave Hansen <dave.hansen@linux.intel.com> writes:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Some method is obviously needed to enable reclaim-based migration.
>>
>> Just like traditional autonuma, there will be some workloads that
>> will benefit like workloads with more "static" configurations where
>> hot pages stay hot and cold pages stay cold.  If pages come and go
>> from the hot and cold sets, the benefits of this approach will be
>> more limited.
>>
>> The benefits are truly workload-based and *not* hardware-based.
>> We do not believe that there is a viable threshold where certain
>> hardware configurations should have this mechanism enabled while
>> others do not.
>>
>> To be conservative, earlier work defaulted to disable reclaim-
>> based migration and did not include a mechanism to enable it.
>> This propses extending the existing "zone_reclaim_mode" (now
>> now really node_reclaim_mode) as a method to enable it.
>>
>> We are open to any alternative that allows end users to enable
>> this mechanism or disable it it workload harm is detected (just
>> like traditional autonuma).
>>
>> The implementation here is pretty simple and entirely unoptimized.
>> On any memory hotplug events, assume that a node was added or
>> removed and recalculate all migration targets.  This ensures that
>> the node_demotion[] array is always ready to be used in case the
>> new reclaim mode is enabled.  This recalculation is far from
>> optimal, most glaringly that it does not even attempt to figure
>> out if nodes are actually coming or going.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>>   b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>>   b/mm/vmscan.c                             |    7 +--
>>   3 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
>> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
>> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
>> @@ -941,6 +941,7 @@ This is value OR'ed together of
>>   1	(bit currently ignored)
>>   2	Zone reclaim writes dirty pages out
>>   4	Zone reclaim swaps pages
>> +8	Zone reclaim migrates pages
>>   =	===================================
>>   
>>   zone_reclaim_mode is disabled by default.  For file servers or workloads
>> @@ -965,3 +966,11 @@ of other processes running on other node
>>   Allowing regular swap effectively restricts allocations to the local
>>   node unless explicitly overridden by memory policies or cpuset
>>   configurations.
>> +
>> +Page migration during reclaim is intended for systems with tiered memory
>> +configurations.  These systems have multiple types of memory with varied
>> +performance characteristics instead of plain NUMA systems where the same
>> +kind of memory is found at varied distances.  Allowing page migration
>> +during reclaim enables these systems to migrate pages from fast tiers to
>> +slow tiers when the fast tier is under pressure.  This migration is
>> +performed before swap.
>> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
>> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
>> @@ -49,6 +49,7 @@
>>   #include <linux/sched/mm.h>
>>   #include <linux/ptrace.h>
>>   #include <linux/oom.h>
>> +#include <linux/memory.h>
>>   
>>   #include <asm/tlbflush.h>
>>   
>> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>>   	 * Avoid any oddities like cycles that could occur
>>   	 * from changes in the topology.  This will leave
>>   	 * a momentary gap when migration is disabled.
>> +	 *
>> +	 * This is superfluous for memory offlining since
>> +	 * MEM_GOING_OFFLINE does it independently, but it
>> +	 * does not hurt to do it a second time.
>>   	 */
>>   	disable_all_migrate_targets();
>>   
>> @@ -3211,6 +3216,60 @@ again:
>>   	/* Is another pass necessary? */
>>   	if (!nodes_empty(next_pass))
>>   		goto again;
>> +}
>>   
>> -	put_online_mems();
>> +/*
>> + * React to hotplug events that might online or offline
>> + * NUMA nodes.
>> + *
>> + * This leaves migrate-on-reclaim transiently disabled
>> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
>> + * This runs whether RECLAIM_MIGRATE is enabled or not.
>> + * That ensures that the user can turn RECLAIM_MIGRATE
>> + * without needing to recalcuate migration targets.
>> + */
>> +#if defined(CONFIG_MEMORY_HOTPLUG)
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> +						 unsigned long action, void *arg)
>> +{
>> +	switch (action) {
>> +	case MEM_GOING_OFFLINE:
>> +		/*
>> +		 * Make sure there are not transient states where
>> +		 * an offline node is a migration target.  This
>> +		 * will leave migration disabled until the offline
>> +		 * completes and the MEM_OFFLINE case below runs.
>> +		 */
>> +		disable_all_migrate_targets();
>> +		break;
>> +	case MEM_OFFLINE:
>> +	case MEM_ONLINE:
>> +		/*
>> +		 * Recalculate the target nodes once the node
>> +		 * reaches its final state (online or offline).
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_CANCEL_OFFLINE:
>> +		/*
>> +		 * MEM_GOING_OFFLINE disabled all the migration
>> +		 * targets.  Reenable them.
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_GOING_ONLINE:
>> +	case MEM_CANCEL_ONLINE:
>> +		break;
>> +	}
>> +
>> +	return notifier_from_errno(0);
>>   }
>> +
>> +static int __init migrate_on_reclaim_init(void)
>> +{
>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>> +	return 0;
>> +}
>> +late_initcall(migrate_on_reclaim_init);
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>> +
>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>    * ABI.  New bits are OK, but existing bits can never change.
>>    */
>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>   
>>   /*
>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
> I found that RECLAIM_MIGRATE is defined but never referenced in the
> patch.
>
> If my understanding of the code were correct, shrink_do_demote_mapping()
> is called by shrink_page_list(), which is used by kswapd and direct
> reclaim.  So as long as the persistent memory node is onlined,
> reclaim-based migration will be enabled regardless of node reclaim mode.

It looks so according to the code. But the intention of a new node 
reclaim mode is to do migration on reclaim *only when* the RECLAIM_MODE 
is enabled by the users.

It looks the patch just clear the migration target node masks if the 
memory is offlined.

So, I'm supposed you need check if node_reclaim is enabled before doing 
migration in shrink_page_list() and also need make node reclaim to adopt 
the new mode.

Please refer to 
https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/

I copied the related chunks here:

+ if (is_demote_ok(page_to_nid(page))) { <--- check if node reclaim is 
enabled + list_add(&page->lru, &demote_pages); + unlock_page(page); + 
continue; + } and @@ -4084,8 +4179,10 @@ static int 
__node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in   		.gfp_mask = current_gfp_context(gfp_mask),
  		.order = order,
  		.priority = NODE_RECLAIM_PRIORITY,
- .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = 
!!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = 
!!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & 
RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) 
|| + (node_reclaim_mode & RECLAIM_MIGRATE)),   		.may_swap = 1,
  		.reclaim_idx = gfp_zone(gfp_mask),
  	};
@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data 
*pgdat, gfp_t gfp_mask, unsigned in   	reclaim_state.reclaimed_slab = 0;
  	p->reclaim_state = &reclaim_state;
  
- if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + 
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + 
(node_reclaim_mode & RECLAIM_MIGRATE)) {   		/*
  		 * Free memory by calling shrink node with increasing
  		 * priorities until we have enough memory freed.
@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t 
gfp_mask, unsigned int order)   	 * thrown out if the node is overallocated. So we do not reclaim
  	 * if less than a specified percentage of the node is used by
  	 * unmapped file backed pages.
+ * + * Migrate mode doesn't care the above restrictions.   	 */
  	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
- node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 
+ node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages 
&& + !(node_reclaim_mode & RECLAIM_MIGRATE))   		return NODE_RECLAIM_FULL;

>
> Best Regards,
> Huang, Ying


[-- Attachment #2: Type: text/html, Size: 12322 bytes --]