All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Zi Yan <ziy@nvidia.com>, Dave Hansen <dave.hansen@linux.intel.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"David Hildenbrand" <david@redhat.com>,
	osalvador <osalvador@suse.de>
Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order
Date: Sat, 19 Jun 2021 16:18:55 +0800	[thread overview]
Message-ID: <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> (Zi Yan's message of "Fri, 18 Jun 2021 11:14:16 -0400")

Zi Yan <ziy@nvidia.com> writes:

> On 18 Jun 2021, at 2:15, Huang Ying wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> When memory fills up on a node, memory contents can be
>> automatically migrated to another node.  The biggest problems are
>> knowing when to migrate and to where the migration should be
>> targeted.
>>
>> The most straightforward way to generate the "to where" list would
>> be to follow the page allocator fallback lists.  Those lists
>> already tell us if memory is full where to look next.  It would
>> also be logical to move memory in that order.
>>
>> But, the allocator fallback lists have a fatal flaw: most nodes
>> appear in all the lists.  This would potentially lead to migration
>> cycles (A->B, B->A, A->B, ...).
>>
>> Instead of using the allocator fallback lists directly, keep a
>> separate node migration ordering.  But, reuse the same data used
>> to generate page allocator fallback in the first place:
>> find_next_best_node().
>>
>> This means that the firmware data used to populate node distances
>> essentially dictates the ordering for now.  It should also be
>> architecture-neutral since all NUMA architectures have a working
>> find_next_best_node().
>>
>> The protocol for node_demotion[] access and writing is not
>> standard.  It has no specific locking and is intended to be read
>> locklessly.  Readers must take care to avoid observing changes
>> that appear incoherent.  This was done so that node_demotion[]
>> locking has no chance of becoming a bottleneck on large systems
>> with lots of CPUs in direct reclaim.
>>
>> This code is unused for now.  It will be called later in the
>> series.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Wei Xu <weixugc@google.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: osalvador <osalvador@suse.de>
>>
>> --
>>
>> Changes from 20200122:
>>  * Add big node_demotion[] comment
>> Changes from 20210302:
>>  * Fix typo in node_demotion[] comment
>> ---
>>  mm/internal.h   |   5 ++
>>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  mm/page_alloc.c |   2 +-
>>  3 files changed, 180 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 2f1182948aa6..0344cd78e170 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>>
>>  #ifdef CONFIG_NUMA
>>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
>> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>>  #else
>>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>>  				unsigned int order)
>>  {
>>  	return NODE_RECLAIM_NOSCAN;
>>  }
>> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
>> +{
>> +	return NUMA_NO_NODE;
>> +}
>>  #endif
>>
>>  extern int hwpoison_filter(struct page *p);
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 6cab668132f9..111f8565f75d 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>  	return rc;
>>  }
>>
>> +
>> +/*
>> + * node_demotion[] example:
>> + *
>> + * Consider a system with two sockets.  Each socket has
>> + * three classes of memory attached: fast, medium and slow.
>> + * Each memory class is placed in its own NUMA node.  The
>> + * CPUs are placed in the node with the "fast" memory.  The
>> + * 6 NUMA nodes (0-5) might be split among the sockets like
>> + * this:
>> + *
>> + *	Socket A: 0, 1, 2
>> + *	Socket B: 3, 4, 5
>> + *
>> + * When Node 0 fills up, its memory should be migrated to
>> + * Node 1.  When Node 1 fills up, it should be migrated to
>> + * Node 2.  The migration path start on the nodes with the
>> + * processors (since allocations default to this node) and
>> + * fast memory, progress through medium and end with the
>> + * slow memory:
>> + *
>> + *	0 -> 1 -> 2 -> stop
>> + *	3 -> 4 -> 5 -> stop
>> + *
>> + * This is represented in the node_demotion[] like this:
>> + *
>> + *	{  1, // Node 0 migrates to 1
>> + *	   2, // Node 1 migrates to 2
>> + *	  -1, // Node 2 does not migrate
>> + *	   4, // Node 3 migrates to 4
>> + *	   5, // Node 4 migrates to 5
>> + *	  -1} // Node 5 does not migrate
>> + */
>> +
>> +/*
>> + * Writes to this array occur without locking.  READ_ONCE()
>> + * is recommended for readers to ensure consistent reads.
>> + */
>>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>>  	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>>
>> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
>>   */
>>  int next_demotion_node(int node)
>>  {
>> -	return node_demotion[node];
>> +	/*
>> +	 * node_demotion[] is updated without excluding
>> +	 * this function from running.  READ_ONCE() avoids
>> +	 * reading multiple, inconsistent 'node' values
>> +	 * during an update.
>> +	 */
>> +	return READ_ONCE(node_demotion[node]);
>>  }
>
> Is it necessary to have two separate patches to add node_demotion and
> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>
> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
> I guess it might be much simpler to just introduce node_demotion with RCU
> in this patch and Patch 3 only takes care of hotplug events.

Hi, Dave,

What do you think about this?

>>
>>  /*
>> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>  }
>>  EXPORT_SYMBOL(migrate_vma_finalize);
>>  #endif /* CONFIG_DEVICE_PRIVATE */
>> +
>> +/* Disable reclaim-based migration. */
>> +static void disable_all_migrate_targets(void)
>> +{
>> +	int node;
>> +
>> +	for_each_online_node(node)
>> +		node_demotion[node] = NUMA_NO_NODE;
>> +}
>> +
>> +/*
>> + * Find an automatic demotion target for 'node'.
>> + * Failing here is OK.  It might just indicate
>> + * being at the end of a chain.
>> + */
>> +static int establish_migrate_target(int node, nodemask_t *used)
>> +{
>> +	int migration_target;
>> +
>> +	/*
>> +	 * Can not set a migration target on a
>> +	 * node with it already set.
>> +	 *
>> +	 * No need for READ_ONCE() here since this
>> +	 * in the write path for node_demotion[].
>> +	 * This should be the only thread writing.
>> +	 */
>> +	if (node_demotion[node] != NUMA_NO_NODE)
>> +		return NUMA_NO_NODE;
>> +
>> +	migration_target = find_next_best_node(node, used);
>> +	if (migration_target == NUMA_NO_NODE)
>> +		return NUMA_NO_NODE;
>> +
>> +	node_demotion[node] = migration_target;
>> +
>> +	return migration_target;
>> +}
>> +
>> +/*
>> + * When memory fills up on a node, memory contents can be
>> + * automatically migrated to another node instead of
>> + * discarded at reclaim.
>> + *
>> + * Establish a "migration path" which will start at nodes
>> + * with CPUs and will follow the priorities used to build the
>> + * page allocator zonelists.
>> + *
>> + * The difference here is that cycles must be avoided.  If
>> + * node0 migrates to node1, then neither node1, nor anything
>> + * node1 migrates to can migrate to node0.
>> + *
>> + * This function can run simultaneously with readers of
>> + * node_demotion[].  However, it can not run simultaneously
>> + * with itself.  Exclusion is provided by memory hotplug events
>> + * being single-threaded.
>> + */
>> +static void __set_migration_target_nodes(void)
>> +{
>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>> +	nodemask_t used_targets = NODE_MASK_NONE;
>> +	int node;
>> +
>> +	/*
>> +	 * Avoid any oddities like cycles that could occur
>> +	 * from changes in the topology.  This will leave
>> +	 * a momentary gap when migration is disabled.
>> +	 */
>> +	disable_all_migrate_targets();
>> +
>> +	/*
>> +	 * Ensure that the "disable" is visible across the system.
>> +	 * Readers will see either a combination of before+disable
>> +	 * state or disable+after.  They will never see before and
>> +	 * after state together.
>> +	 *
>> +	 * The before+after state together might have cycles and
>> +	 * could cause readers to do things like loop until this
>> +	 * function finishes.  This ensures they can only see a
>> +	 * single "bad" read and would, for instance, only loop
>> +	 * once.
>> +	 */
>> +	smp_wmb();
>> +
>> +	/*
>> +	 * Allocations go close to CPUs, first.  Assume that
>> +	 * the migration path starts at the nodes with CPUs.
>> +	 */
>> +	next_pass = node_states[N_CPU];
>
> Is there a plan of allowing user to change where the migration
> path starts? Or maybe one step further providing an interface
> to allow user to specify the demotion path. Something like
> /sys/devices/system/node/node*/node_demotion.

I don't think that's necessary at least for now.  Do you know any real
world use case for this?

Best Regards,
Huang, Ying

[snip]

  reply	other threads:[~2021-06-19  8:23 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-18  6:15 [PATCH -V8 00/10] Migrate Pages in lieu of discard Huang Ying
2021-06-18  6:15 ` [PATCH -V8 01/10] mm/numa: node demotion data structure and lookup Huang Ying
2021-06-18  6:15 ` [PATCH -V8 02/10] mm/numa: automatically generate node migration order Huang Ying
2021-06-18 15:14   ` Zi Yan
2021-06-19  8:18     ` Huang, Ying [this message]
2021-06-19  8:18       ` Huang, Ying
2021-06-21 14:50       ` Zi Yan
2021-06-22  1:14         ` Huang, Ying
2021-06-22  1:14           ` Huang, Ying
2021-06-22 12:13           ` Dave Hansen
2021-06-22 12:06         ` Dave Hansen
2021-06-22 12:48           ` Zi Yan
2021-06-21 19:51       ` Yang Shi
2021-06-21 19:51         ` Yang Shi
2021-06-22  0:55         ` Huang, Ying
2021-06-22  0:55           ` Huang, Ying
2021-06-21 19:53       ` Dave Hansen
2021-06-22  0:54         ` Huang, Ying
2021-06-22  0:54           ` Huang, Ying
2021-06-18  6:15 ` [PATCH -V8 03/10] mm/migrate: update node demotion order during on hotplug events Huang Ying
2021-06-18  6:15 ` [PATCH -V8 04/10] mm/migrate: make migrate_pages() return nr_succeeded Huang Ying
2021-06-18  7:53   ` Oscar Salvador
2021-06-18  8:15     ` Huang, Ying
2021-06-18  8:15       ` Huang, Ying
2021-06-18  6:15 ` [PATCH -V8 05/10] mm/migrate: demote pages during reclaim Huang Ying
2021-06-18 15:42   ` Zi Yan
2021-06-19  7:45     ` Huang, Ying
2021-06-19  7:45       ` Huang, Ying
2021-06-21 19:58       ` Yang Shi
2021-06-21 19:58         ` Yang Shi
2021-06-22  2:09         ` Huang, Ying
2021-06-22  2:09           ` Huang, Ying
2021-06-22 17:15           ` Yang Shi
2021-06-22 17:15             ` Yang Shi
2021-06-22 18:15             ` Zi Yan
2021-06-23  2:19             ` Huang, Ying
2021-06-23  2:19               ` Huang, Ying
2021-06-18  6:15 ` [PATCH -V8 06/10] mm/vmscan: add page demotion counter Huang Ying
2021-06-18  6:15 ` [PATCH -V8 07/10] mm/vmscan: add helper for querying ability to age anonymous pages Huang Ying
2021-06-18 15:45   ` Zi Yan
2021-06-19  2:33     ` Huang, Ying
2021-06-19  2:33       ` Huang, Ying
2021-06-18  6:15 ` [PATCH -V8 08/10] mm/vmscan: Consider anonymous pages without swap Huang Ying
2021-06-18  6:15 ` [PATCH -V8 09/10] mm/vmscan: never demote for memcg reclaim Huang Ying
2021-06-18  6:15 ` [PATCH -V8 10/10] mm/migrate: add sysfs interface to enable reclaim migration Huang Ying
2021-06-22  9:00 ` [PATCH -V8 00/10] Migrate Pages in lieu of discard Oscar Salvador
2021-06-23  1:12   ` Huang, Ying
2021-06-23  1:12     ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.