Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [RFC][PATCH 0/8] Migrate Pages in lieu of discard
@ 2020-06-29 23:45 Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
                   ` (8 more replies)
  0 siblings, 9 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams

I've been sitting on these for too long.  Tha main purpose of this
post is to have a public discussion with the other folks who are
interested in this functionalty and converge on a single
implementation.

This set directly incorporates a statictics patch from Yang Shi and
also includes one to ensure good behavior with cgroup reclaim which
was very closely derived from this series:

	https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com/

Since the last post, the major changes are:
 - Added patch to skip migration when doing cgroup reclaim
 - Added stats patch from Yang Shi

The full series is also available here:

	https://github.com/hansendc/linux/tree/automigrate-20200629

--

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Other things need to be built to
promote hot pages back to DRAM.

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from here.  It includes
autonuma-based hot page promotion back to DRAM:

	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

--

Dave Hansen (5):
      mm/numa: node demotion data structure and lookup
      mm/vmscan: Attempt to migrate page in lieu of discard
      mm/numa: automatically generate node migration order
      mm/vmscan: never demote for memcg reclaim
      mm/numa: new reclaim mode to enable reclaim-based migration

Keith Busch (2):
      mm/migrate: Defer allocating new page until needed
      mm/vmscan: Consider anonymous pages without swap

Yang Shi (1):
      mm/vmscan: add page demotion counter

 Documentation/admin-guide/sysctl/vm.rst |    9
 include/linux/migrate.h                 |    6
 include/linux/node.h                    |    9
 include/linux/vm_event_item.h           |    2
 include/trace/events/migrate.h          |    3
 mm/debug.c                              |    1
 mm/internal.h                           |    1
 mm/migrate.c                            |  400 ++++++++++++++++++++++++++------
 mm/page_alloc.c                         |    2
 mm/vmscan.c                             |   88 ++++++-
 mm/vmstat.c                             |    2
 11 files changed, 439 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/migrate.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path	2020-06-29 16:34:36.849312609 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:36.853312609 -0700
@@ -1159,6 +1159,29 @@ out:
 	return rc;
 }
 
+static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; -1 if @node is terminal
+ */
+int next_demotion_node(int node)
+{
+	get_online_mems();
+	while (true) {
+		node = node_demotion[node];
+		if (node == NUMA_NO_NODE)
+			break;
+		if (node_online(node))
+			break;
+	}
+	put_online_mems();
+	return node;
+}
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-07-01  8:47   ` Greg Thelen
  2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, kbusch, yang.shi, rientjes, ying.huang,
	dan.j.williams


From: Keith Busch <kbusch@kernel.org>

Migrating pages had been allocating the new page before it was actually
needed. Subsequent operations may still fail, which would have to handle
cleaning up the newly allocated page when it was never used.

Defer allocating the page until we are actually ready to make use of
it, after locking the original page. This simplifies error handling,
but should not have any functional change in behavior. This is just
refactoring page migration so the main part can more easily be reused
by other code.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/migrate.c |  148 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 75 insertions(+), 73 deletions(-)

diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
--- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed	2020-06-29 16:34:37.896312607 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:37.900312607 -0700
@@ -1014,56 +1014,17 @@ out:
 	return rc;
 }
 
-static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, enum migrate_mode mode)
+static int __unmap_and_move(new_page_t get_new_page,
+			    free_page_t put_new_page,
+			    unsigned long private, struct page *page,
+			    enum migrate_mode mode,
+			    enum migrate_reason reason)
 {
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(page);
-
-	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
-			goto out;
-
-		/*
-		 * It's not safe for direct compaction to call lock_page.
-		 * For example, during page readahead pages are added locked
-		 * to the LRU. Later, when the IO completes the pages are
-		 * marked uptodate and unlocked. However, the queueing
-		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
-		 * second or third page, the process can end up locking
-		 * the same page twice and deadlocking. Rather than
-		 * trying to be clever about what pages can be locked,
-		 * avoid the use of lock_page for direct compaction
-		 * altogether.
-		 */
-		if (current->flags & PF_MEMALLOC)
-			goto out;
-
-		lock_page(page);
-	}
-
-	if (PageWriteback(page)) {
-		/*
-		 * Only in the case of a full synchronous migration is it
-		 * necessary to wait for PageWriteback. In the async case,
-		 * the retry loop is too short and in the sync-light case,
-		 * the overhead of stalling is too much
-		 */
-		switch (mode) {
-		case MIGRATE_SYNC:
-		case MIGRATE_SYNC_NO_COPY:
-			break;
-		default:
-			rc = -EBUSY;
-			goto out_unlock;
-		}
-		if (!force)
-			goto out_unlock;
-		wait_on_page_writeback(page);
-	}
+	struct page *newpage;
 
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
@@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
 	if (PageAnon(page) && !PageKsm(page))
 		anon_vma = page_get_anon_vma(page);
 
+	newpage = get_new_page(page, private);
+	if (!newpage) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
 	/*
 	 * Block others from accessing the new page when we get around to
 	 * establishing additional references. We are usually the only one
@@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
 	 * This is much like races on refcount of oldpage: just don't BUG().
 	 */
 	if (unlikely(!trylock_page(newpage)))
-		goto out_unlock;
+		goto out_put;
 
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
-		goto out_unlock_both;
+		goto out_unlock;
 	}
 
 	/*
@@ -1114,7 +1081,7 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto out_unlock_both;
+			goto out_unlock;
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
@@ -1131,15 +1098,9 @@ static int __unmap_and_move(struct page
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
-
-out_unlock_both:
-	unlock_page(newpage);
 out_unlock:
-	/* Drop an anon_vma reference if we took one */
-	if (anon_vma)
-		put_anon_vma(anon_vma);
-	unlock_page(page);
-out:
+	unlock_page(newpage);
+out_put:
 	/*
 	 * If migration is successful, decrease refcount of the newpage
 	 * which will not free the page because new page owner increased
@@ -1150,12 +1111,20 @@ out:
 	 * state.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
+		set_page_owner_migrate_reason(newpage, reason);
 		if (unlikely(!is_lru))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
+	} else if (put_new_page) {
+		put_new_page(newpage, private);
+	} else {
+		put_page(newpage);
 	}
-
+out:
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 	return rc;
 }
 
@@ -1203,8 +1172,7 @@ static ICE_noinline int unmap_and_move(n
 				   int force, enum migrate_mode mode,
 				   enum migrate_reason reason)
 {
-	int rc = MIGRATEPAGE_SUCCESS;
-	struct page *newpage = NULL;
+	int rc = -EAGAIN;
 
 	if (!thp_migration_supported() && PageTransHuge(page))
 		return -ENOMEM;
@@ -1219,17 +1187,57 @@ static ICE_noinline int unmap_and_move(n
 				__ClearPageIsolated(page);
 			unlock_page(page);
 		}
+		rc = MIGRATEPAGE_SUCCESS;
 		goto out;
 	}
 
-	newpage = get_new_page(page, private);
-	if (!newpage)
-		return -ENOMEM;
+	if (!trylock_page(page)) {
+		if (!force || mode == MIGRATE_ASYNC)
+			return rc;
 
-	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS)
-		set_page_owner_migrate_reason(newpage, reason);
+		/*
+		 * It's not safe for direct compaction to call lock_page.
+		 * For example, during page readahead pages are added locked
+		 * to the LRU. Later, when the IO completes the pages are
+		 * marked uptodate and unlocked. However, the queueing
+		 * could be merging multiple pages for one bio (e.g.
+		 * mpage_readpages). If an allocation happens for the
+		 * second or third page, the process can end up locking
+		 * the same page twice and deadlocking. Rather than
+		 * trying to be clever about what pages can be locked,
+		 * avoid the use of lock_page for direct compaction
+		 * altogether.
+		 */
+		if (current->flags & PF_MEMALLOC)
+			return rc;
+
+		lock_page(page);
+	}
+
+	if (PageWriteback(page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			rc = -EBUSY;
+			goto out_unlock;
+		}
+		if (!force)
+			goto out_unlock;
+		wait_on_page_writeback(page);
+	}
+	rc = __unmap_and_move(get_new_page, put_new_page, private,
+			      page, mode, reason);
 
+out_unlock:
+	unlock_page(page);
 out:
 	if (rc != -EAGAIN) {
 		/*
@@ -1269,9 +1277,8 @@ out:
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
 				putback_lru_page(page);
-				goto put_new;
+				goto done;
 			}
-
 			lock_page(page);
 			if (PageMovable(page))
 				putback_movable_page(page);
@@ -1280,13 +1287,8 @@ out:
 			unlock_page(page);
 			put_page(page);
 		}
-put_new:
-		if (put_new_page)
-			put_new_page(newpage, private);
-		else
-			put_page(newpage);
 	}
-
+done:
 	return rc;
 }
 
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-07-01  0:47   ` David Rientjes
  2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, kbusch, yang.shi, rientjes, ying.huang,
	dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.

When handling anonymous pages, this will be considered before swap if
enabled. Should the demotion fail for any reason, the page reclaim
will proceed as if the demotion feature was not enabled.

Some places we would like to see this used:

  1. Persistent memory being as a slower, cheaper DRAM replacement
  2. Remote memory-only "expansion" NUMA nodes
  3. Resolving memory imbalances where one NUMA node is seeing more
     allocation activity than another.  This helps keep more recent
     allocations closer to the CPUs on the node doing the allocating.

Yang Shi's patches used an alternative approach where to-be-discarded
pages were collected on a separate discard list and then discarded
as a batch with migrate_pages().  This results in simpler code and
has all the performance advantages of batching, but has the
disadvantage that pages which fail to migrate never get swapped.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/include/linux/migrate.h        |    6 ++++
 b/include/trace/events/migrate.h |    3 +-
 b/mm/debug.c                     |    1 
 b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
 b/mm/vmscan.c                    |   25 ++++++++++++++++++
 5 files changed, 86 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
--- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
+++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, int extra_count);
+extern int migrate_demote_mapping(struct page *page);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
 	return -ENOSYS;
 }
 
+static inline int migrate_demote_mapping(struct page *page)
+{
+	return -ENOSYS;
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
+++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTION,	"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
--- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
+++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demotion",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
--- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
@@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
 	return node;
 }
 
+static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
+{
+	/*
+	 * 'mask' targets allocation only to the desired node in the
+	 * migration path, and fails fast if the allocation can not be
+	 * immediately satisfied.  Reclaim is already active and heroic
+	 * allocation efforts are unwanted.
+	 */
+	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
+			__GFP_MOVABLE;
+	struct page *newpage;
+
+	if (PageTransHuge(page)) {
+		mask |= __GFP_COMP;
+		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
+		if (newpage)
+			prep_transhuge_page(newpage);
+	} else
+		newpage = alloc_pages_node(node, mask, 0);
+
+	return newpage;
+}
+
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ *                            demotion node.
+ * @page: A locked, isolated, non-huge page that should migrate to its current
+ *        node's demotion target, if available. Since this is intended to be
+ *        called during memory reclaim, all flag options are set to fail fast.
+ *
+ * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
+ */
+int migrate_demote_mapping(struct page *page)
+{
+	int next_nid = next_demotion_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid == NUMA_NO_NODE)
+		return -ENOSYS;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return -ENOMEM;
+
+	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
+	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+				page, MIGRATE_ASYNC, MR_DEMOTION);
+}
+
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.
diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
--- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
@@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
 	LIST_HEAD(free_pages);
 	unsigned nr_reclaimed = 0;
 	unsigned pgactivate = 0;
+	int rc;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
@@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
+		rc = migrate_demote_mapping(page);
+		/*
+		 * -ENOMEM on a THP may indicate either migration is
+		 * unsupported or there was not enough contiguous
+		 * space. Split the THP into base pages and retry the
+		 * head immediately. The tail pages will be considered
+		 * individually within the current loop's page list.
+		 */
+		if (rc == -ENOMEM && PageTransHuge(page) &&
+		    !split_huge_page_to_list(page, page_list))
+			rc = migrate_demote_mapping(page);
+
+		if (rc == MIGRATEPAGE_SUCCESS) {
+			unlock_page(page);
+			if (likely(put_page_testzero(page)))
+				goto free_it;
+			/*
+			 * Speculative reference will free this page,
+			 * so leave it off the LRU.
+			 */
+			nr_reclaimed++;
+			continue;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 4/8] mm/vmscan: add page demotion counter
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (2 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Yang Shi <yang.shi@linux.alibaba.com>

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

[ daveh:
   - __count_vm_events() a bit, and made them look at the THP
     size directly rather than getting data from migrate_pages()
]

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/include/linux/vm_event_item.h |    2 ++
 b/mm/migrate.c                  |   13 ++++++++++++-
 b/mm/vmscan.c                   |    1 +
 b/mm/vmstat.c                   |    2 ++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h
--- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.332312601 -0700
+++ b/include/linux/vm_event_item.h	2020-06-29 16:34:40.342312601 -0700
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGDEMOTE_KSWAPD,
+		PGDEMOTE_DIRECT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
diff -puN mm/migrate.c~mm-vmscan-add-page-demotion-counter mm/migrate.c
--- a/mm/migrate.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.334312601 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:40.343312601 -0700
@@ -1187,6 +1187,7 @@ static struct page *alloc_demote_node_pa
 int migrate_demote_mapping(struct page *page)
 {
 	int next_nid = next_demotion_node(page_to_nid(page));
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -1198,8 +1199,18 @@ int migrate_demote_mapping(struct page *
 		return -ENOMEM;
 
 	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
-	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+	ret = __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
 				page, MIGRATE_ASYNC, MR_DEMOTION);
+
+	if (ret == MIGRATEPAGE_SUCCESS) {
+		int nr_demoted = hpage_nr_pages(page);
+		if (current_is_kswapd())
+			__count_vm_events(PGDEMOTE_KSWAPD, nr_demoted);
+		else
+			__count_vm_events(PGDEMOTE_DIRECT, nr_demoted);
+	}
+
+	return ret;
 }
 
 
diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c
--- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.336312601 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:40.344312601 -0700
@@ -140,6 +140,7 @@ struct scan_control {
 		unsigned int immediate;
 		unsigned int file_taken;
 		unsigned int taken;
+		unsigned int demoted;
 	} nr;
 
 	/* for recording the reclaimed slab by now */
diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c
--- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter	2020-06-29 16:34:40.339312601 -0700
+++ b/mm/vmstat.c	2020-06-29 16:34:40.345312601 -0700
@@ -1198,6 +1198,8 @@ const char * const vmstat_text[] = {
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 5/8] mm/numa: automatically generate node migration order
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (3 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-30  8:22   ` Huang, Ying
  2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists.  Those
lists already tell us if memory is full where to look next.  It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/internal.h   |    1 
 b/mm/migrate.c    |  130 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    2 
 3 files changed, 131 insertions(+), 2 deletions(-)

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.629312597 -0700
+++ b/mm/internal.h	2020-06-29 16:34:41.638312597 -0700
@@ -192,6 +192,7 @@ extern int user_min_free_kbytes;
 
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.631312597 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:41.639312597 -0700
@@ -1128,6 +1128,10 @@ out:
 	return rc;
 }
 
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers.
+ */
 static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
 /**
@@ -1141,7 +1145,13 @@ int next_demotion_node(int node)
 {
 	get_online_mems();
 	while (true) {
-		node = node_demotion[node];
+		/*
+		 * node_demotion[] is updated without excluding
+		 * this function from running.  READ_ONCE() avoids
+		 * 'node' checks reading different values from
+		 * node_demotion[].
+		 */
+		node = READ_ONCE(node_demotion[node]);
 		if (node == NUMA_NO_NODE)
 			break;
 		if (node_online(node))
@@ -3086,3 +3096,121 @@ void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+void set_migration_target_nodes(void)
+{
+	nodemask_t next_pass = NODE_MASK_NONE;
+	nodemask_t this_pass = NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	get_online_mems();
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.
+	 *
+	 * But, do this only once per pass so that multiple
+	 * source nodes can share a target node.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2020-06-29 16:34:41.634312597 -0700
+++ b/mm/page_alloc.c	2020-06-29 16:34:41.641312597 -0700
@@ -5591,7 +5591,7 @@ static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (4 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, kbusch, vishal.l.verma, yang.shi,
	rientjes, ying.huang, dan.j.williams


From: Keith Busch <keith.busch@intel.com>

Age and reclaim anonymous pages if a migration path is available. The
node has other recourses for inactive anonymous pages beyond swap,

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Keith Busch <kbusch@kernel.org>
[vishal: fixup the migration->demotion rename]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

--

Changes from Dave 06/2020:
 * rename reclaim_anon_pages()->can_reclaim_anon_pages()

---

 b/include/linux/node.h |    9 +++++++++
 b/mm/vmscan.c          |   32 +++++++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 5 deletions(-)

diff -puN include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/node.h
--- a/include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap	2020-06-29 16:34:42.861312594 -0700
+++ b/include/linux/node.h	2020-06-29 16:34:42.867312594 -0700
@@ -180,4 +180,13 @@ static inline void register_hugetlbfs_wi
 
 #define to_node(device) container_of(device, struct node, dev)
 
+#ifdef CONFIG_MIGRATION
+extern int next_demotion_node(int node);
+#else
+static inline int next_demotion_node(int node)
+{
+	return NUMA_NO_NODE;
+}
+#endif
+
 #endif /* _LINUX_NODE_H_ */
diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap	2020-06-29 16:34:42.863312594 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:42.868312594 -0700
@@ -288,6 +288,26 @@ static bool writeback_throttling_sane(st
 }
 #endif
 
+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+					  int node_id)
+{
+	/* Always age anon pages when we have swap */
+	if (memcg == NULL) {
+		if (get_nr_swap_pages() > 0)
+			return true;
+	} else {
+		if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+			return true;
+	}
+
+	/* Also age anon pages if we can auto-migrate them */
+	if (next_demotion_node(node_id) >= 0)
+		return true;
+
+	/* No way to reclaim anon pages */
+	return false;
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -299,7 +319,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2267,7 +2287,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2572,7 +2592,9 @@ static void shrink_lruvec(struct lruvec
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
+	if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
+			       lruvec_pgdat(lruvec)->node_id) &&
+	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 }
@@ -2642,7 +2664,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
@@ -3395,7 +3417,7 @@ static void age_active_anon(struct pglis
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	if (!total_swap_pages)
+	if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
 		return;
 
 	lruvec = mem_cgroup_lruvec(NULL, pgdat);
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (5 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
  2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
  8 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

Global reclaim aims to reduce the amount of memory used on
a given node or set of nodes.  Migrating pages to another
node serves this purpose.

memcg reclaim is different.  Its goal is to reduce the
total memory consumption of the entire memcg, across all
nodes.  Migration does not assist memcg reclaim because
it just moves page contents between nodes rather than
actually reducing memory consumption.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/mm/vmscan.c |   61 +++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 19 deletions(-)

diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c
--- a/mm/vmscan.c~never-demote-for-memcg-reclaim	2020-06-29 16:34:44.018312591 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:44.023312591 -0700
@@ -289,7 +289,8 @@ static bool writeback_throttling_sane(st
 #endif
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-					  int node_id)
+					  int node_id,
+					  struct scan_control *sc)
 {
 	/* Always age anon pages when we have swap */
 	if (memcg == NULL) {
@@ -300,8 +301,14 @@ static inline bool can_reclaim_anon_page
 			return true;
 	}
 
-	/* Also age anon pages if we can auto-migrate them */
-	if (next_demotion_node(node_id) >= 0)
+	/*
+	 * Also age anon pages if we can auto-migrate them.
+	 *
+	 * Migrating a page does not reduce comsumption of a
+	 * memcg so should not be performed when in memcg
+	 * reclaim.
+	 */
+	if ((sc && cgroup_reclaim(sc)) && (next_demotion_node(node_id) >= 0))
 		return true;
 
 	/* No way to reclaim anon pages */
@@ -319,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
+	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -1084,6 +1091,32 @@ static void page_check_dirty_writeback(s
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+
+static int shrink_do_demote_mapping(struct page *page,
+				    struct list_head *page_list,
+				    struct scan_control *sc)
+{
+	int rc;
+
+	/* It is pointless to do demotion in memcg reclaim */
+	if (cgroup_reclaim(sc))
+		return -ENOTSUPP;
+
+	rc = migrate_demote_mapping(page);
+	/*
+	 * -ENOMEM on a THP may indicate either migration is
+	 * unsupported or there was not enough contiguous
+	 * space. Split the THP into base pages and retry the
+	 * head immediately. The tail pages will be considered
+	 * individually within the current loop's page list.
+	 */
+	if (rc == -ENOMEM && PageTransHuge(page) &&
+	    !split_huge_page_to_list(page, page_list))
+		rc = migrate_demote_mapping(page);
+
+	return rc;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1251,17 +1284,7 @@ static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
-		rc = migrate_demote_mapping(page);
-		/*
-		 * -ENOMEM on a THP may indicate either migration is
-		 * unsupported or there was not enough contiguous
-		 * space. Split the THP into base pages and retry the
-		 * head immediately. The tail pages will be considered
-		 * individually within the current loop's page list.
-		 */
-		if (rc == -ENOMEM && PageTransHuge(page) &&
-		    !split_huge_page_to_list(page, page_list))
-			rc = migrate_demote_mapping(page);
+		rc = shrink_do_demote_mapping(page, page_list, sc);
 
 		if (rc == MIGRATEPAGE_SUCCESS) {
 			unlock_page(page);
@@ -2287,7 +2310,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
+	if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2593,7 +2616,7 @@ static void shrink_lruvec(struct lruvec
 	 * rebalance the anon lru active/inactive ratio.
 	 */
 	if (can_reclaim_anon_pages(lruvec_memcg(lruvec),
-			       lruvec_pgdat(lruvec)->node_id) &&
+			       lruvec_pgdat(lruvec)->node_id, sc) &&
 	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
@@ -2664,7 +2687,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, pgdat->node_id))
+	if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
 	return inactive_lru_pages > pages_for_compaction;
@@ -3417,7 +3440,7 @@ static void age_active_anon(struct pglis
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	if (!can_reclaim_anon_pages(NULL, pgdat->node_id))
+	if (!can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
 		return;
 
 	lruvec = mem_cgroup_lruvec(NULL, pgdat);
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (6 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
@ 2020-06-29 23:45 ` Dave Hansen
  2020-06-30  7:23   ` Huang, Ying
  2020-07-03  9:30   ` Huang, Ying
  2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
  8 siblings, 2 replies; 43+ messages in thread
From: Dave Hansen @ 2020-06-29 23:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled.  This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
 b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
 b/mm/vmscan.c                             |    7 +--
 3 files changed, 73 insertions(+), 4 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@ This is value OR'ed together of
 1	(bit currently ignored)
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -965,3 +966,11 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
 	 * Avoid any oddities like cycles that could occur
 	 * from changes in the topology.  This will leave
 	 * a momentary gap when migration is disabled.
+	 *
+	 * This is superfluous for memory offlining since
+	 * MEM_GOING_OFFLINE does it independently, but it
+	 * does not hurt to do it a second time.
 	 */
 	disable_all_migrate_targets();
 
@@ -3211,6 +3216,60 @@ again:
 	/* Is another pass necessary? */
 	if (!nodes_empty(next_pass))
 		goto again;
+}
 
-	put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
 }
+
+static int __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
  * ABI.  New bits are OK, but existing bits can never change.
  */
-#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
+#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
_


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
@ 2020-06-30  7:23   ` Huang, Ying
  2020-06-30 17:50     ` Yang Shi
  2020-07-03  9:30   ` Huang, Ying
  1 sibling, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2020-06-30  7:23 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, yang.shi, rientjes, dan.j.williams

Hi, Dave,

Dave Hansen <dave.hansen@linux.intel.com> writes:

> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This propses extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> The implementation here is pretty simple and entirely unoptimized.
> On any memory hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that
> the node_demotion[] array is always ready to be used in case the
> new reclaim mode is enabled.  This recalculation is far from
> optimal, most glaringly that it does not even attempt to figure
> out if nodes are actually coming or going.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>  b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>  b/mm/vmscan.c                             |    7 +--
>  3 files changed, 73 insertions(+), 4 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
> @@ -941,6 +941,7 @@ This is value OR'ed together of
>  1	(bit currently ignored)
>  2	Zone reclaim writes dirty pages out
>  4	Zone reclaim swaps pages
> +8	Zone reclaim migrates pages
>  =	===================================
>  
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -965,3 +966,11 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.
> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
> @@ -49,6 +49,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/ptrace.h>
>  #include <linux/oom.h>
> +#include <linux/memory.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>  	 * Avoid any oddities like cycles that could occur
>  	 * from changes in the topology.  This will leave
>  	 * a momentary gap when migration is disabled.
> +	 *
> +	 * This is superfluous for memory offlining since
> +	 * MEM_GOING_OFFLINE does it independently, but it
> +	 * does not hurt to do it a second time.
>  	 */
>  	disable_all_migrate_targets();
>  
> @@ -3211,6 +3216,60 @@ again:
>  	/* Is another pass necessary? */
>  	if (!nodes_empty(next_pass))
>  		goto again;
> +}
>  
> -	put_online_mems();
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *arg)
> +{
> +	switch (action) {
> +	case MEM_GOING_OFFLINE:
> +		/*
> +		 * Make sure there are not transient states where
> +		 * an offline node is a migration target.  This
> +		 * will leave migration disabled until the offline
> +		 * completes and the MEM_OFFLINE case below runs.
> +		 */
> +		disable_all_migrate_targets();
> +		break;
> +	case MEM_OFFLINE:
> +	case MEM_ONLINE:
> +		/*
> +		 * Recalculate the target nodes once the node
> +		 * reaches its final state (online or offline).
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_CANCEL_OFFLINE:
> +		/*
> +		 * MEM_GOING_OFFLINE disabled all the migration
> +		 * targets.  Reenable them.
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_GOING_ONLINE:
> +	case MEM_CANCEL_ONLINE:
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
>  }
> +
> +static int __init migrate_on_reclaim_init(void)
> +{
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> +	return 0;
> +}
> +late_initcall(migrate_on_reclaim_init);
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>   * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>   * ABI.  New bits are OK, but existing bits can never change.
>   */
> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>  
>  /*
>   * Priority for NODE_RECLAIM. This determines the fraction of pages

I found that RECLAIM_MIGRATE is defined but never referenced in the
patch.

If my understanding of the code were correct, shrink_do_demote_mapping()
is called by shrink_page_list(), which is used by kswapd and direct
reclaim.  So as long as the persistent memory node is onlined,
reclaim-based migration will be enabled regardless of node reclaim mode.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order
  2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
@ 2020-06-30  8:22   ` Huang, Ying
  2020-07-01 18:23     ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2020-06-30  8:22 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, yang.shi, rientjes, dan.j.williams

Dave Hansen <dave.hansen@linux.intel.com> writes:

> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> +	int migration_target;
> +
> +	/*
> +	 * Can not set a migration target on a
> +	 * node with it already set.
> +	 *
> +	 * No need for READ_ONCE() here since this
> +	 * in the write path for node_demotion[].
> +	 * This should be the only thread writing.
> +	 */
> +	if (node_demotion[node] != NUMA_NO_NODE)
> +		return NUMA_NO_NODE;
> +
> +	migration_target = find_next_best_node(node, used);
> +	if (migration_target == NUMA_NO_NODE)
> +		return NUMA_NO_NODE;
> +
> +	node_demotion[node] = migration_target;
> +
> +	return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided.  If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[].  However, it can not run simultaneously
> + * with itself.  Exclusion is provided by memory hotplug events
> + * being single-threaded.
> + */
> +void set_migration_target_nodes(void)
> +{
> +	nodemask_t next_pass = NODE_MASK_NONE;
> +	nodemask_t this_pass = NODE_MASK_NONE;
> +	nodemask_t used_targets = NODE_MASK_NONE;
> +	int node;
> +
> +	get_online_mems();
> +	/*
> +	 * Avoid any oddities like cycles that could occur
> +	 * from changes in the topology.  This will leave
> +	 * a momentary gap when migration is disabled.
> +	 */
> +	disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 *
> +	 * The before+after state together might have cycles and
> +	 * could cause readers to do things like loop until this
> +	 * function finishes.  This ensures they can only see a
> +	 * single "bad" read and would, for instance, only loop
> +	 * once.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Allocations go close to CPUs, first.  Assume that
> +	 * the migration path starts at the nodes with CPUs.
> +	 */
> +	next_pass = node_states[N_CPU];
> +again:
> +	this_pass = next_pass;
> +	next_pass = NODE_MASK_NONE;
> +	/*
> +	 * To avoid cycles in the migration "graph", ensure
> +	 * that migration sources are not future targets by
> +	 * setting them in 'used_targets'.
> +	 *
> +	 * But, do this only once per pass so that multiple
> +	 * source nodes can share a target node.

establish_migrate_target() calls find_next_best_node(), which will set
target_node in used_targets.  So it seems that the nodes_or() below is
only necessary to initialize used_targets, and multiple source nodes
cannot share one target node in current implementation.

Best Regards,
Huang, Ying

> +	 */
> +	nodes_or(used_targets, used_targets, this_pass);
> +	for_each_node_mask(node, this_pass) {
> +		int target_node = establish_migrate_target(node, &used_targets);
> +
> +		if (target_node == NUMA_NO_NODE)
> +			continue;
> +
> +		/* Visit targets from this pass in the next pass: */
> +		node_set(target_node, next_pass);
> +	}
> +	/* Is another pass necessary? */
> +	if (!nodes_empty(next_pass))
> +		goto again;
> +
> +	put_online_mems();
> +}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-30  7:23   ` Huang, Ying
@ 2020-06-30 17:50     ` Yang Shi
  2020-07-01  0:48       ` Huang, Ying
  2020-07-01 16:02       ` Dave Hansen
  0 siblings, 2 replies; 43+ messages in thread
From: Yang Shi @ 2020-06-30 17:50 UTC (permalink / raw)
  To: Huang, Ying, Dave Hansen; +Cc: linux-kernel, linux-mm, rientjes, dan.j.williams


[-- Attachment #1: Type: text/plain, Size: 9959 bytes --]



On 6/30/20 12:23 AM, Huang, Ying wrote:
> Hi, Dave,
>
> Dave Hansen <dave.hansen@linux.intel.com> writes:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Some method is obviously needed to enable reclaim-based migration.
>>
>> Just like traditional autonuma, there will be some workloads that
>> will benefit like workloads with more "static" configurations where
>> hot pages stay hot and cold pages stay cold.  If pages come and go
>> from the hot and cold sets, the benefits of this approach will be
>> more limited.
>>
>> The benefits are truly workload-based and *not* hardware-based.
>> We do not believe that there is a viable threshold where certain
>> hardware configurations should have this mechanism enabled while
>> others do not.
>>
>> To be conservative, earlier work defaulted to disable reclaim-
>> based migration and did not include a mechanism to enable it.
>> This propses extending the existing "zone_reclaim_mode" (now
>> now really node_reclaim_mode) as a method to enable it.
>>
>> We are open to any alternative that allows end users to enable
>> this mechanism or disable it it workload harm is detected (just
>> like traditional autonuma).
>>
>> The implementation here is pretty simple and entirely unoptimized.
>> On any memory hotplug events, assume that a node was added or
>> removed and recalculate all migration targets.  This ensures that
>> the node_demotion[] array is always ready to be used in case the
>> new reclaim mode is enabled.  This recalculation is far from
>> optimal, most glaringly that it does not even attempt to figure
>> out if nodes are actually coming or going.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>>   b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>>   b/mm/vmscan.c                             |    7 +--
>>   3 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
>> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
>> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
>> @@ -941,6 +941,7 @@ This is value OR'ed together of
>>   1	(bit currently ignored)
>>   2	Zone reclaim writes dirty pages out
>>   4	Zone reclaim swaps pages
>> +8	Zone reclaim migrates pages
>>   =	===================================
>>   
>>   zone_reclaim_mode is disabled by default.  For file servers or workloads
>> @@ -965,3 +966,11 @@ of other processes running on other node
>>   Allowing regular swap effectively restricts allocations to the local
>>   node unless explicitly overridden by memory policies or cpuset
>>   configurations.
>> +
>> +Page migration during reclaim is intended for systems with tiered memory
>> +configurations.  These systems have multiple types of memory with varied
>> +performance characteristics instead of plain NUMA systems where the same
>> +kind of memory is found at varied distances.  Allowing page migration
>> +during reclaim enables these systems to migrate pages from fast tiers to
>> +slow tiers when the fast tier is under pressure.  This migration is
>> +performed before swap.
>> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
>> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
>> @@ -49,6 +49,7 @@
>>   #include <linux/sched/mm.h>
>>   #include <linux/ptrace.h>
>>   #include <linux/oom.h>
>> +#include <linux/memory.h>
>>   
>>   #include <asm/tlbflush.h>
>>   
>> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>>   	 * Avoid any oddities like cycles that could occur
>>   	 * from changes in the topology.  This will leave
>>   	 * a momentary gap when migration is disabled.
>> +	 *
>> +	 * This is superfluous for memory offlining since
>> +	 * MEM_GOING_OFFLINE does it independently, but it
>> +	 * does not hurt to do it a second time.
>>   	 */
>>   	disable_all_migrate_targets();
>>   
>> @@ -3211,6 +3216,60 @@ again:
>>   	/* Is another pass necessary? */
>>   	if (!nodes_empty(next_pass))
>>   		goto again;
>> +}
>>   
>> -	put_online_mems();
>> +/*
>> + * React to hotplug events that might online or offline
>> + * NUMA nodes.
>> + *
>> + * This leaves migrate-on-reclaim transiently disabled
>> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
>> + * This runs whether RECLAIM_MIGRATE is enabled or not.
>> + * That ensures that the user can turn RECLAIM_MIGRATE
>> + * without needing to recalcuate migration targets.
>> + */
>> +#if defined(CONFIG_MEMORY_HOTPLUG)
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> +						 unsigned long action, void *arg)
>> +{
>> +	switch (action) {
>> +	case MEM_GOING_OFFLINE:
>> +		/*
>> +		 * Make sure there are not transient states where
>> +		 * an offline node is a migration target.  This
>> +		 * will leave migration disabled until the offline
>> +		 * completes and the MEM_OFFLINE case below runs.
>> +		 */
>> +		disable_all_migrate_targets();
>> +		break;
>> +	case MEM_OFFLINE:
>> +	case MEM_ONLINE:
>> +		/*
>> +		 * Recalculate the target nodes once the node
>> +		 * reaches its final state (online or offline).
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_CANCEL_OFFLINE:
>> +		/*
>> +		 * MEM_GOING_OFFLINE disabled all the migration
>> +		 * targets.  Reenable them.
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_GOING_ONLINE:
>> +	case MEM_CANCEL_ONLINE:
>> +		break;
>> +	}
>> +
>> +	return notifier_from_errno(0);
>>   }
>> +
>> +static int __init migrate_on_reclaim_init(void)
>> +{
>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>> +	return 0;
>> +}
>> +late_initcall(migrate_on_reclaim_init);
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>> +
>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>    * ABI.  New bits are OK, but existing bits can never change.
>>    */
>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>   
>>   /*
>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
> I found that RECLAIM_MIGRATE is defined but never referenced in the
> patch.
>
> If my understanding of the code were correct, shrink_do_demote_mapping()
> is called by shrink_page_list(), which is used by kswapd and direct
> reclaim.  So as long as the persistent memory node is onlined,
> reclaim-based migration will be enabled regardless of node reclaim mode.

It looks so according to the code. But the intention of a new node 
reclaim mode is to do migration on reclaim *only when* the RECLAIM_MODE 
is enabled by the users.

It looks the patch just clear the migration target node masks if the 
memory is offlined.

So, I'm supposed you need check if node_reclaim is enabled before doing 
migration in shrink_page_list() and also need make node reclaim to adopt 
the new mode.

Please refer to 
https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/

I copied the related chunks here:

+ if (is_demote_ok(page_to_nid(page))) { <--- check if node reclaim is 
enabled + list_add(&page->lru, &demote_pages); + unlock_page(page); + 
continue; + } and @@ -4084,8 +4179,10 @@ static int 
__node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in   		.gfp_mask = current_gfp_context(gfp_mask),
  		.order = order,
  		.priority = NODE_RECLAIM_PRIORITY,
- .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = 
!!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = 
!!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & 
RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) 
|| + (node_reclaim_mode & RECLAIM_MIGRATE)),   		.may_swap = 1,
  		.reclaim_idx = gfp_zone(gfp_mask),
  	};
@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data 
*pgdat, gfp_t gfp_mask, unsigned in   	reclaim_state.reclaimed_slab = 0;
  	p->reclaim_state = &reclaim_state;
  
- if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + 
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + 
(node_reclaim_mode & RECLAIM_MIGRATE)) {   		/*
  		 * Free memory by calling shrink node with increasing
  		 * priorities until we have enough memory freed.
@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t 
gfp_mask, unsigned int order)   	 * thrown out if the node is overallocated. So we do not reclaim
  	 * if less than a specified percentage of the node is used by
  	 * unmapped file backed pages.
+ * + * Migrate mode doesn't care the above restrictions.   	 */
  	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
- node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 
+ node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages 
&& + !(node_reclaim_mode & RECLAIM_MIGRATE))   		return NODE_RECLAIM_FULL;

>
> Best Regards,
> Huang, Ying


[-- Attachment #2: Type: text/html, Size: 12322 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard
  2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
                   ` (7 preceding siblings ...)
  2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
@ 2020-06-30 18:36 ` Shakeel Butt
  2020-06-30 18:51   ` Dave Hansen
  8 siblings, 1 reply; 43+ messages in thread
From: Shakeel Butt @ 2020-06-30 18:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux MM, Yang Shi, David Rientjes, Huang Ying, Dan Williams

On Mon, Jun 29, 2020 at 4:48 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> I've been sitting on these for too long.  Tha main purpose of this
> post is to have a public discussion with the other folks who are
> interested in this functionalty and converge on a single
> implementation.
>
> This set directly incorporates a statictics patch from Yang Shi and
> also includes one to ensure good behavior with cgroup reclaim which
> was very closely derived from this series:
>
>         https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com/
>
> Since the last post, the major changes are:
>  - Added patch to skip migration when doing cgroup reclaim
>  - Added stats patch from Yang Shi
>
> The full series is also available here:
>
>         https://github.com/hansendc/linux/tree/automigrate-20200629
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>

I have a high level question. Given a reclaim request for a set of
nodes, if there is no demotion path out of that set, should the kernel
still consider the migrations within the set of nodes? Basically
should the decision to allow migrations within a reclaim request be
taken at the node level or the reclaim request (or allocation level)?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard
  2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
@ 2020-06-30 18:51   ` Dave Hansen
  2020-06-30 19:25     ` Shakeel Butt
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-06-30 18:51 UTC (permalink / raw)
  To: Shakeel Butt, Dave Hansen
  Cc: LKML, Linux MM, Yang Shi, David Rientjes, Huang Ying, Dan Williams

On 6/30/20 11:36 AM, Shakeel Butt wrote:
>> This is part of a larger patch set.  If you want to apply these or
>> play with them, I'd suggest using the tree from here.  It includes
>> autonuma-based hot page promotion back to DRAM:
>>
>>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>>
>> This is also all based on an upstream mechanism that allows
>> persistent memory to be onlined and used as if it were volatile:
>>
>>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>>
> I have a high level question. Given a reclaim request for a set of
> nodes, if there is no demotion path out of that set, should the kernel
> still consider the migrations within the set of nodes? 

OK, to be specific, we're talking about a case where we've arrived at
try_to_free_pages() and, say, all of the nodes on the system are set in
sc->nodemask?  Isn't the common case that all nodes are set in
sc->nodemask?  Since there is never a demotion path out of the set of
all nodes, the common case would be that there is no demotion path out
of a reclaim node set.

If that's true, I'd say that the kernel still needs to consider
migrations even within the set.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard
  2020-06-30 18:51   ` Dave Hansen
@ 2020-06-30 19:25     ` Shakeel Butt
  2020-06-30 19:31       ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Shakeel Butt @ 2020-06-30 19:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, LKML, Linux MM, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams

On Tue, Jun 30, 2020 at 11:51 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/30/20 11:36 AM, Shakeel Butt wrote:
> >> This is part of a larger patch set.  If you want to apply these or
> >> play with them, I'd suggest using the tree from here.  It includes
> >> autonuma-based hot page promotion back to DRAM:
> >>
> >>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
> >>
> >> This is also all based on an upstream mechanism that allows
> >> persistent memory to be onlined and used as if it were volatile:
> >>
> >>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
> >>
> > I have a high level question. Given a reclaim request for a set of
> > nodes, if there is no demotion path out of that set, should the kernel
> > still consider the migrations within the set of nodes?
>
> OK, to be specific, we're talking about a case where we've arrived at
> try_to_free_pages()

Yes.

> and, say, all of the nodes on the system are set in
> sc->nodemask?  Isn't the common case that all nodes are set in
> sc->nodemask?

Depends on the workload but for normal users, yes.

> Since there is never a demotion path out of the set of
> all nodes, the common case would be that there is no demotion path out
> of a reclaim node set.
>
> If that's true, I'd say that the kernel still needs to consider
> migrations even within the set.

In my opinion it should be a user defined policy but I think that
discussion is orthogonal to this patch series. As I understand, this
patch series aims to add the migration-within-reclaim infrastructure,
IMO the policies, optimizations, heuristics can come later.

BTW is this proposal only for systems having multi-tiers of memory?
Can a multi-node DRAM-only system take advantage of this proposal? For
example I have a system with two DRAM nodes running two jobs
hardwalled to each node. For each job the other node is kind of
low-tier memory. If I can describe the per-job demotion paths then
these jobs can take advantage of this proposal during occasional
peaks.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard
  2020-06-30 19:25     ` Shakeel Butt
@ 2020-06-30 19:31       ` Dave Hansen
  2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-06-30 19:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Hansen, LKML, Linux MM, Yang Shi, David Rientjes,
	Huang Ying, Dan Williams

On 6/30/20 12:25 PM, Shakeel Butt wrote:
>> Since there is never a demotion path out of the set of
>> all nodes, the common case would be that there is no demotion path out
>> of a reclaim node set.
>>
>> If that's true, I'd say that the kernel still needs to consider
>> migrations even within the set.
> In my opinion it should be a user defined policy but I think that
> discussion is orthogonal to this patch series. As I understand, this
> patch series aims to add the migration-within-reclaim infrastructure,
> IMO the policies, optimizations, heuristics can come later.

Yes, this should be considered to add the infrastructure and one
_simple_ policy implementation which sets up migration away from nodes
with CPUs to more distant nodes without CPUs.

This simple policy will be useful for (but not limited to) volatile-use
persistent memory like Intel's Optane DIMMS.

> BTW is this proposal only for systems having multi-tiers of memory?
> Can a multi-node DRAM-only system take advantage of this proposal? For
> example I have a system with two DRAM nodes running two jobs
> hardwalled to each node. For each job the other node is kind of
> low-tier memory. If I can describe the per-job demotion paths then
> these jobs can take advantage of this proposal during occasional
> peaks.

I don't see any reason it could not work there.  There would just need
to be a way to set up a different demotion path policy that what was
done here.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
@ 2020-07-01  0:47   ` David Rientjes
  2020-07-01  1:29     ` Yang Shi
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: David Rientjes @ 2020-07-01  0:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, kbusch, yang.shi, ying.huang, dan.j.williams

On Mon, 29 Jun 2020, Dave Hansen wrote:

> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
> 
> When handling anonymous pages, this will be considered before swap if
> enabled. Should the demotion fail for any reason, the page reclaim
> will proceed as if the demotion feature was not enabled.
> 

Thanks for sharing these patches and kick-starting the conversation, Dave.

Could this cause us to break a user's mbind() or allow a user to 
circumvent their cpuset.mems?

Because we don't have a mapping of the page back to its allocation 
context (or the process context in which it was allocated), it seems like 
both are possible.

So let's assume that migration nodes cannot be other DRAM nodes.  
Otherwise, memory pressure could be intentionally or unintentionally 
induced to migrate these pages to another node.  Do we have such a 
restriction on migration nodes?

> Some places we would like to see this used:
> 
>   1. Persistent memory being as a slower, cheaper DRAM replacement
>   2. Remote memory-only "expansion" NUMA nodes
>   3. Resolving memory imbalances where one NUMA node is seeing more
>      allocation activity than another.  This helps keep more recent
>      allocations closer to the CPUs on the node doing the allocating.
> 

(3) is the concerning one given the above if we are to use 
migrate_demote_mapping() for DRAM node balancing.

> Yang Shi's patches used an alternative approach where to-be-discarded
> pages were collected on a separate discard list and then discarded
> as a batch with migrate_pages().  This results in simpler code and
> has all the performance advantages of batching, but has the
> disadvantage that pages which fail to migrate never get swapped.
> 
> #Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  b/include/linux/migrate.h        |    6 ++++
>  b/include/trace/events/migrate.h |    3 +-
>  b/mm/debug.c                     |    1 
>  b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>  b/mm/vmscan.c                    |   25 ++++++++++++++++++
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ enum migrate_reason {
>  	MR_MEMPOLICY_MBIND,
>  	MR_NUMA_MISPLACED,
>  	MR_CONTIG_RANGE,
> +	MR_DEMOTION,
>  	MR_TYPES
>  };
>  
> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>  				  struct page *newpage, struct page *page);
>  extern int migrate_page_move_mapping(struct address_space *mapping,
>  		struct page *newpage, struct page *page, int extra_count);
> +extern int migrate_demote_mapping(struct page *page);
>  #else
>  
>  static inline void putback_movable_pages(struct list_head *l) {}
> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>  	return -ENOSYS;
>  }
>  
> +static inline int migrate_demote_mapping(struct page *page)
> +{
> +	return -ENOSYS;
> +}
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_COMPACTION
> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
> @@ -20,7 +20,8 @@
>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
> -	EMe(MR_CONTIG_RANGE,	"contig_range")
> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
> +	EMe(MR_DEMOTION,	"demotion")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>  	"mempolicy_mbind",
>  	"numa_misplaced",
>  	"cma",
> +	"demotion",
>  };
>  
>  const struct trace_print_flags pageflag_names[] = {
> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>  	return node;
>  }
>  
> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> +{
> +	/*
> +	 * 'mask' targets allocation only to the desired node in the
> +	 * migration path, and fails fast if the allocation can not be
> +	 * immediately satisfied.  Reclaim is already active and heroic
> +	 * allocation efforts are unwanted.
> +	 */
> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> +			__GFP_MOVABLE;

GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
actually want to kick kswapd on the pmem node?

If not, GFP_TRANSHUGE_LIGHT does a trick where it does 
GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same 
here although the __GFP_IO and __GFP_FS would be unnecessary (but not 
harmful).

> +	struct page *newpage;
> +
> +	if (PageTransHuge(page)) {
> +		mask |= __GFP_COMP;
> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
> +		if (newpage)
> +			prep_transhuge_page(newpage);
> +	} else
> +		newpage = alloc_pages_node(node, mask, 0);
> +
> +	return newpage;
> +}
> +
> +/**
> + * migrate_demote_mapping() - Migrate this page and its mappings to its
> + *                            demotion node.
> + * @page: A locked, isolated, non-huge page that should migrate to its current
> + *        node's demotion target, if available. Since this is intended to be
> + *        called during memory reclaim, all flag options are set to fail fast.
> + *
> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
> + */
> +int migrate_demote_mapping(struct page *page)
> +{
> +	int next_nid = next_demotion_node(page_to_nid(page));
> +
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(PageHuge(page), page);
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +	if (next_nid == NUMA_NO_NODE)
> +		return -ENOSYS;
> +	if (PageTransHuge(page) && !thp_migration_supported())
> +		return -ENOMEM;
> +
> +	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
> +	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
> +				page, MIGRATE_ASYNC, MR_DEMOTION);
> +}
> +
> +
>  /*
>   * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
>   * around it.
> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
> +++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
>  	LIST_HEAD(free_pages);
>  	unsigned nr_reclaimed = 0;
>  	unsigned pgactivate = 0;
> +	int rc;
>  
>  	memset(stat, 0, sizeof(*stat));
>  	cond_resched();
> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>  			; /* try to reclaim the page below */
>  		}
>  
> +		rc = migrate_demote_mapping(page);
> +		/*
> +		 * -ENOMEM on a THP may indicate either migration is
> +		 * unsupported or there was not enough contiguous
> +		 * space. Split the THP into base pages and retry the
> +		 * head immediately. The tail pages will be considered
> +		 * individually within the current loop's page list.
> +		 */
> +		if (rc == -ENOMEM && PageTransHuge(page) &&
> +		    !split_huge_page_to_list(page, page_list))
> +			rc = migrate_demote_mapping(page);
> +
> +		if (rc == MIGRATEPAGE_SUCCESS) {
> +			unlock_page(page);
> +			if (likely(put_page_testzero(page)))
> +				goto free_it;
> +			/*
> +			 * Speculative reference will free this page,
> +			 * so leave it off the LRU.
> +			 */
> +			nr_reclaimed++;

nr_reclaimed += nr_pages instead?

> +			continue;
> +		}
> +
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-30 17:50     ` Yang Shi
@ 2020-07-01  0:48       ` Huang, Ying
  2020-07-01  1:12         ` Yang Shi
  2020-07-01 16:02       ` Dave Hansen
  1 sibling, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2020-07-01  0:48 UTC (permalink / raw)
  To: Yang Shi; +Cc: Dave Hansen, linux-kernel, linux-mm, rientjes, dan.j.williams

Hi, Yang,

Yang Shi <yang.shi@linux.alibaba.com> writes:

>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>    * ABI.  New bits are OK, but existing bits can never change.
>>>    */
>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>     /*
>>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>> patch.
>>
>> If my understanding of the code were correct, shrink_do_demote_mapping()
>> is called by shrink_page_list(), which is used by kswapd and direct
>> reclaim.  So as long as the persistent memory node is onlined,
>> reclaim-based migration will be enabled regardless of node reclaim mode.
>
> It looks so according to the code. But the intention of a new node
> reclaim mode is to do migration on reclaim *only when* the
> RECLAIM_MODE is enabled by the users.
>
> It looks the patch just clear the migration target node masks if the
> memory is offlined.
>
> So, I'm supposed you need check if node_reclaim is enabled before
> doing migration in shrink_page_list() and also need make node reclaim
> to adopt the new mode.

But why shouldn't we migrate in kswapd and direct reclaim?  I think that
we may need a way to control it, but shouldn't disable it
unconditionally.

> Please refer to
> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-07-01  0:48       ` Huang, Ying
@ 2020-07-01  1:12         ` Yang Shi
  2020-07-01  1:28           ` Huang, Ying
  0 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2020-07-01  1:12 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Dave Hansen, linux-kernel, linux-mm, rientjes, dan.j.williams



On 6/30/20 5:48 PM, Huang, Ying wrote:
> Hi, Yang,
>
> Yang Shi <yang.shi@linux.alibaba.com> writes:
>
>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>     * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>     * ABI.  New bits are OK, but existing bits can never change.
>>>>     */
>>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>>      /*
>>>>     * Priority for NODE_RECLAIM. This determines the fraction of pages
>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>> patch.
>>>
>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>> is called by shrink_page_list(), which is used by kswapd and direct
>>> reclaim.  So as long as the persistent memory node is onlined,
>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>> It looks so according to the code. But the intention of a new node
>> reclaim mode is to do migration on reclaim *only when* the
>> RECLAIM_MODE is enabled by the users.
>>
>> It looks the patch just clear the migration target node masks if the
>> memory is offlined.
>>
>> So, I'm supposed you need check if node_reclaim is enabled before
>> doing migration in shrink_page_list() and also need make node reclaim
>> to adopt the new mode.
> But why shouldn't we migrate in kswapd and direct reclaim?  I think that
> we may need a way to control it, but shouldn't disable it
> unconditionally.

Let me share some background. In the past discussions on LKML and last 
year's LSFMM the opt-in approach was preferred since the new feature 
might be not stable and mature.  So the new node reclaim mode was 
suggested by both Mel and Michal. I'm supposed this is still a valid 
point now.

Once it is mature and stable enough we definitely could make it 
universally preferred and default behavior.

>
>> Please refer to
>> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>>
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-07-01  1:12         ` Yang Shi
@ 2020-07-01  1:28           ` Huang, Ying
  0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-01  1:28 UTC (permalink / raw)
  To: Yang Shi; +Cc: Dave Hansen, linux-kernel, linux-mm, rientjes, dan.j.williams

Yang Shi <yang.shi@linux.alibaba.com> writes:

> On 6/30/20 5:48 PM, Huang, Ying wrote:
>> Hi, Yang,
>>
>> Yang Shi <yang.shi@linux.alibaba.com> writes:
>>
>>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>>     * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>>     * ABI.  New bits are OK, but existing bits can never change.
>>>>>     */
>>>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>>>      /*
>>>>>     * Priority for NODE_RECLAIM. This determines the fraction of pages
>>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>>> patch.
>>>>
>>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>>> is called by shrink_page_list(), which is used by kswapd and direct
>>>> reclaim.  So as long as the persistent memory node is onlined,
>>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>>> It looks so according to the code. But the intention of a new node
>>> reclaim mode is to do migration on reclaim *only when* the
>>> RECLAIM_MODE is enabled by the users.
>>>
>>> It looks the patch just clear the migration target node masks if the
>>> memory is offlined.
>>>
>>> So, I'm supposed you need check if node_reclaim is enabled before
>>> doing migration in shrink_page_list() and also need make node reclaim
>>> to adopt the new mode.
>> But why shouldn't we migrate in kswapd and direct reclaim?  I think that
>> we may need a way to control it, but shouldn't disable it
>> unconditionally.
>
> Let me share some background. In the past discussions on LKML and last
> year's LSFMM the opt-in approach was preferred since the new feature
> might be not stable and mature.  So the new node reclaim mode was
> suggested by both Mel and Michal. I'm supposed this is still a valid
> point now.

Is there any technical reason?  I think the code isn't very complex.  If
we really worry about stable and mature, isn't it enough to provide some
way to enable/disable the feature?  Even for kswapd and direct reclaim?

Best Regards,
Huang, Ying

> Once it is mature and stable enough we definitely could make it
> universally preferred and default behavior.
>
>>
>>> Please refer to
>>> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>>>
>> Best Regards,
>> Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  0:47   ` David Rientjes
@ 2020-07-01  1:29     ` Yang Shi
  2020-07-01  5:41       ` David Rientjes
  2020-07-01  1:40     ` Huang, Ying
  2020-07-01 16:48     ` Dave Hansen
  2 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2020-07-01  1:29 UTC (permalink / raw)
  To: David Rientjes, Dave Hansen
  Cc: linux-kernel, linux-mm, kbusch, ying.huang, dan.j.williams



On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.

Yes, this could break the memory placement policy enforced by mbind and 
cpuset. I discussed this with Michal on mailing list and tried to find a 
way to solve it, but unfortunately it seems not easy as what you 
mentioned above. The memory policy and cpuset is stored in task_struct 
rather than mm_struct. It is not easy to trace back to task_struct from 
page (owner field of mm_struct might be helpful, but it depends on 
CONFIG_MEMCG and is not preferred way).

>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node.  Do we have such a
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>>
>>    1. Persistent memory being as a slower, cheaper DRAM replacement
>>    2. Remote memory-only "expansion" NUMA nodes
>>    3. Resolving memory imbalances where one NUMA node is seeing more
>>       allocation activity than another.  This helps keep more recent
>>       allocations closer to the CPUs on the node doing the allocating.
>>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages().  This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>>
>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Keith Busch <kbusch@kernel.org>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/include/linux/migrate.h        |    6 ++++
>>   b/include/trace/events/migrate.h |    3 +-
>>   b/mm/debug.c                     |    1
>>   b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>>   b/mm/vmscan.c                    |   25 ++++++++++++++++++
>>   5 files changed, 86 insertions(+), 1 deletion(-)
>>
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>   	MR_MEMPOLICY_MBIND,
>>   	MR_NUMA_MISPLACED,
>>   	MR_CONTIG_RANGE,
>> +	MR_DEMOTION,
>>   	MR_TYPES
>>   };
>>   
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>>   				  struct page *newpage, struct page *page);
>>   extern int migrate_page_move_mapping(struct address_space *mapping,
>>   		struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>>   #else
>>   
>>   static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>>   	return -ENOSYS;
>>   }
>>   
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> +	return -ENOSYS;
>> +}
>>   #endif /* CONFIG_MIGRATION */
>>   
>>   #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>>   	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>>   	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>>   	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
>> -	EMe(MR_CONTIG_RANGE,	"contig_range")
>> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
>> +	EMe(MR_DEMOTION,	"demotion")
>>   
>>   /*
>>    * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>>   	"mempolicy_mbind",
>>   	"numa_misplaced",
>>   	"cma",
>> +	"demotion",
>>   };
>>   
>>   const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>>   	return node;
>>   }
>>   
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?
>
> If not, GFP_TRANSHUGE_LIGHT does a trick where it does
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not
> harmful).

I'm not sure how Dave thought about this, however, IMHO kicking kswapd 
on pmem node would help to free memory then improve migration success 
rate. In my implementation, as Dave mentioned in the commit log, the 
migration candidates are put on a separate list then migrated in batch 
by calling migrate_pages(). Kicking kswapd on pmem would help to improve 
success rate since migrate_pages() will retry a couple of times.

Dave's implementation (as you see in this patch) does migration for per 
page basis, if migration is failed it will try swap. Kicking kswapd on 
pmem would also help the later migration. However, IMHO it seems 
migration retry should be still faster than swap.

>
>> +	struct page *newpage;
>> +
>> +	if (PageTransHuge(page)) {
>> +		mask |= __GFP_COMP;
>> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> +		if (newpage)
>> +			prep_transhuge_page(newpage);
>> +	} else
>> +		newpage = alloc_pages_node(node, mask, 0);
>> +
>> +	return newpage;
>> +}
>> +
>> +/**
>> + * migrate_demote_mapping() - Migrate this page and its mappings to its
>> + *                            demotion node.
>> + * @page: A locked, isolated, non-huge page that should migrate to its current
>> + *        node's demotion target, if available. Since this is intended to be
>> + *        called during memory reclaim, all flag options are set to fail fast.
>> + *
>> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
>> + */
>> +int migrate_demote_mapping(struct page *page)
>> +{
>> +	int next_nid = next_demotion_node(page_to_nid(page));
>> +
>> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +	VM_BUG_ON_PAGE(PageHuge(page), page);
>> +	VM_BUG_ON_PAGE(PageLRU(page), page);
>> +
>> +	if (next_nid == NUMA_NO_NODE)
>> +		return -ENOSYS;
>> +	if (PageTransHuge(page) && !thp_migration_supported())
>> +		return -ENOMEM;
>> +
>> +	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
>> +	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
>> +				page, MIGRATE_ASYNC, MR_DEMOTION);
>> +}
>> +
>> +
>>   /*
>>    * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
>>    * around it.
>> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
>> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
>> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
>>   	LIST_HEAD(free_pages);
>>   	unsigned nr_reclaimed = 0;
>>   	unsigned pgactivate = 0;
>> +	int rc;
>>   
>>   	memset(stat, 0, sizeof(*stat));
>>   	cond_resched();
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>   			; /* try to reclaim the page below */
>>   		}
>>   
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> nr_reclaimed += nr_pages instead?
>
>> +			continue;
>> +		}
>> +
>>   		/*
>>   		 * Anonymous process memory has backing store?
>>   		 * Try to allocate it some swap space here.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  0:47   ` David Rientjes
  2020-07-01  1:29     ` Yang Shi
@ 2020-07-01  1:40     ` Huang, Ying
  2020-07-01 16:48     ` Dave Hansen
  2 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-01  1:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, yang.shi, dan.j.williams

David Rientjes <rientjes@google.com> writes:

> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> 
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>> 
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>> 
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.

For mbind, I think we don't have enough information during reclaim to
enforce the node binding policy.  But for cpuset, if cgroup v2 (with the
unified hierarchy) is used, it's possible to get the node binding policy
via something like,

  cgroup_get_e_css(page->mem_cgroup, &cpuset_cgrp_subsys)

> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>> 
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
>> 
>
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages().  This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>> 
>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Keith Busch <kbusch@kernel.org>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>> 
>>  b/include/linux/migrate.h        |    6 ++++
>>  b/include/trace/events/migrate.h |    3 +-
>>  b/mm/debug.c                     |    1 
>>  b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>>  b/mm/vmscan.c                    |   25 ++++++++++++++++++
>>  5 files changed, 86 insertions(+), 1 deletion(-)
>> 
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>  	MR_MEMPOLICY_MBIND,
>>  	MR_NUMA_MISPLACED,
>>  	MR_CONTIG_RANGE,
>> +	MR_DEMOTION,
>>  	MR_TYPES
>>  };
>>  
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>>  				  struct page *newpage, struct page *page);
>>  extern int migrate_page_move_mapping(struct address_space *mapping,
>>  		struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>>  #else
>>  
>>  static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>>  	return -ENOSYS;
>>  }
>>  
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> +	return -ENOSYS;
>> +}
>>  #endif /* CONFIG_MIGRATION */
>>  
>>  #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
>> -	EMe(MR_CONTIG_RANGE,	"contig_range")
>> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
>> +	EMe(MR_DEMOTION,	"demotion")
>>  
>>  /*
>>   * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>>  	"mempolicy_mbind",
>>  	"numa_misplaced",
>>  	"cma",
>> +	"demotion",
>>  };
>>  
>>  const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>>  	return node;
>>  }
>>  
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

I think it should be a good idea to kick kswapd on the PMEM node.
Because otherwise, we will discard more pages in DRAM node.  And in
general, the DRAM pages are hotter than the PMEM pages, because the cold
DRAM pages are migrated to the PMEM node.

> If not, GFP_TRANSHUGE_LIGHT does a trick where it does 
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same 
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not 
> harmful).
>
>> +	struct page *newpage;
>> +
>> +	if (PageTransHuge(page)) {
>> +		mask |= __GFP_COMP;
>> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> +		if (newpage)
>> +			prep_transhuge_page(newpage);
>> +	} else
>> +		newpage = alloc_pages_node(node, mask, 0);
>> +
>> +	return newpage;
>> +}
>> +

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  1:29     ` Yang Shi
@ 2020-07-01  5:41       ` David Rientjes
  2020-07-01  8:54         ` Huang, Ying
                           ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: David Rientjes @ 2020-07-01  5:41 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, ying.huang, dan.j.williams

On Tue, 30 Jun 2020, Yang Shi wrote:

> > > From: Dave Hansen <dave.hansen@linux.intel.com>
> > > 
> > > If a memory node has a preferred migration path to demote cold pages,
> > > attempt to move those inactive pages to that migration node before
> > > reclaiming. This will better utilize available memory, provide a faster
> > > tier than swapping or discarding, and allow such pages to be reused
> > > immediately without IO to retrieve the data.
> > > 
> > > When handling anonymous pages, this will be considered before swap if
> > > enabled. Should the demotion fail for any reason, the page reclaim
> > > will proceed as if the demotion feature was not enabled.
> > > 
> > Thanks for sharing these patches and kick-starting the conversation, Dave.
> > 
> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
> > 
> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
> 
> Yes, this could break the memory placement policy enforced by mbind and
> cpuset. I discussed this with Michal on mailing list and tried to find a way
> to solve it, but unfortunately it seems not easy as what you mentioned above.
> The memory policy and cpuset is stored in task_struct rather than mm_struct.
> It is not easy to trace back to task_struct from page (owner field of
> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
> preferred way).
> 

Yeah, and Ying made a similar response to this message.

We can do this if we consider pmem not to be a separate memory tier from 
the system perspective, however, but rather the socket perspective.  In 
other words, a node can only demote to a series of exclusive pmem ranges 
and promote to the same series of ranges in reverse order.  So DRAM node 0 
can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
one DRAM node.

This naturally takes care of mbind() and cpuset.mems if we consider pmem 
just to be slower volatile memory and we don't need to deal with the 
latency concerns of cross socket migration.  A user page will never be 
demoted to a pmem range across the socket and will never be promoted to a 
different DRAM node that it doesn't have access to.

That can work with the NUMA abstraction for pmem, but it could also 
theoretically be a new memory zone instead.  If all memory living on pmem 
is migratable (the natural way that memory hotplug is done, so we can 
offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
would determine whether we can allocate directly from this memory based on 
system config or a new gfp flag that could be set for users of a mempolicy 
that allows allocations directly from pmem.  If abstracted as a NUMA node 
instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
make much sense.

Kswapd would need to be enlightened for proper pgdat and pmem balancing 
but in theory it should be simpler because it only has its own node to 
manage.  Existing per-zone watermarks might be easy to use to fine tune 
the policy from userspace: the scale factor determines how much memory we 
try to keep free on DRAM for migration from pmem, for example.  We also 
wouldn't have to deal with node hotplug or updating of demotion/promotion 
node chains.

Maybe the strongest advantage of the node abstraction is the ability to 
use autonuma and migrate_pages()/move_pages() API for moving pages 
explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
  2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
@ 2020-07-01  8:47   ` Greg Thelen
  2020-07-01 14:46     ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Greg Thelen @ 2020-07-01  8:47 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: linux-mm, Dave Hansen, kbusch, yang.shi, rientjes, ying.huang,
	dan.j.williams

Dave Hansen <dave.hansen@linux.intel.com> wrote:

> From: Keith Busch <kbusch@kernel.org>
>
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
>
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Is there any concern that the src page is now held PG_locked over the
dst page allocation, which might wander into
reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
just wondering about the additional latency imposed on unrelated threads
who want access src page.

> #Signed-off-by: Keith Busch <keith.busch@intel.com>

Is commented Signed-off-by intentional?  Same applies to later patches.

> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>
>  b/mm/migrate.c |  148 ++++++++++++++++++++++++++++-----------------------------
>  1 file changed, 75 insertions(+), 73 deletions(-)
>
> diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
> --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed	2020-06-29 16:34:37.896312607 -0700
> +++ b/mm/migrate.c	2020-06-29 16:34:37.900312607 -0700
> @@ -1014,56 +1014,17 @@ out:
>  	return rc;
>  }
>  
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> -				int force, enum migrate_mode mode)
> +static int __unmap_and_move(new_page_t get_new_page,
> +			    free_page_t put_new_page,
> +			    unsigned long private, struct page *page,
> +			    enum migrate_mode mode,
> +			    enum migrate_reason reason)
>  {
>  	int rc = -EAGAIN;
>  	int page_was_mapped = 0;
>  	struct anon_vma *anon_vma = NULL;
>  	bool is_lru = !__PageMovable(page);
> -
> -	if (!trylock_page(page)) {
> -		if (!force || mode == MIGRATE_ASYNC)
> -			goto out;
> -
> -		/*
> -		 * It's not safe for direct compaction to call lock_page.
> -		 * For example, during page readahead pages are added locked
> -		 * to the LRU. Later, when the IO completes the pages are
> -		 * marked uptodate and unlocked. However, the queueing
> -		 * could be merging multiple pages for one bio (e.g.
> -		 * mpage_readpages). If an allocation happens for the
> -		 * second or third page, the process can end up locking
> -		 * the same page twice and deadlocking. Rather than
> -		 * trying to be clever about what pages can be locked,
> -		 * avoid the use of lock_page for direct compaction
> -		 * altogether.
> -		 */
> -		if (current->flags & PF_MEMALLOC)
> -			goto out;
> -
> -		lock_page(page);
> -	}
> -
> -	if (PageWriteback(page)) {
> -		/*
> -		 * Only in the case of a full synchronous migration is it
> -		 * necessary to wait for PageWriteback. In the async case,
> -		 * the retry loop is too short and in the sync-light case,
> -		 * the overhead of stalling is too much
> -		 */
> -		switch (mode) {
> -		case MIGRATE_SYNC:
> -		case MIGRATE_SYNC_NO_COPY:
> -			break;
> -		default:
> -			rc = -EBUSY;
> -			goto out_unlock;
> -		}
> -		if (!force)
> -			goto out_unlock;
> -		wait_on_page_writeback(page);
> -	}
> +	struct page *newpage;
>  
>  	/*
>  	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
>  	if (PageAnon(page) && !PageKsm(page))
>  		anon_vma = page_get_anon_vma(page);
>  
> +	newpage = get_new_page(page, private);
> +	if (!newpage) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
>  	/*
>  	 * Block others from accessing the new page when we get around to
>  	 * establishing additional references. We are usually the only one
> @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
>  	 * This is much like races on refcount of oldpage: just don't BUG().
>  	 */
>  	if (unlikely(!trylock_page(newpage)))
> -		goto out_unlock;
> +		goto out_put;
>  
>  	if (unlikely(!is_lru)) {
>  		rc = move_to_new_page(newpage, page, mode);
> -		goto out_unlock_both;
> +		goto out_unlock;
>  	}
>  
>  	/*
> @@ -1114,7 +1081,7 @@ static int __unmap_and_move(struct page
>  		VM_BUG_ON_PAGE(PageAnon(page), page);
>  		if (page_has_private(page)) {
>  			try_to_free_buffers(page);
> -			goto out_unlock_both;
> +			goto out_unlock;
>  		}
>  	} else if (page_mapped(page)) {
>  		/* Establish migration ptes */
> @@ -1131,15 +1098,9 @@ static int __unmap_and_move(struct page
>  	if (page_was_mapped)
>  		remove_migration_ptes(page,
>  			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
> -
> -out_unlock_both:
> -	unlock_page(newpage);
>  out_unlock:
> -	/* Drop an anon_vma reference if we took one */
> -	if (anon_vma)
> -		put_anon_vma(anon_vma);
> -	unlock_page(page);
> -out:
> +	unlock_page(newpage);
> +out_put:
>  	/*
>  	 * If migration is successful, decrease refcount of the newpage
>  	 * which will not free the page because new page owner increased
> @@ -1150,12 +1111,20 @@ out:
>  	 * state.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> +		set_page_owner_migrate_reason(newpage, reason);
>  		if (unlikely(!is_lru))
>  			put_page(newpage);
>  		else
>  			putback_lru_page(newpage);
> +	} else if (put_new_page) {
> +		put_new_page(newpage, private);
> +	} else {
> +		put_page(newpage);
>  	}
> -
> +out:
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma)
> +		put_anon_vma(anon_vma);
>  	return rc;
>  }
>  
> @@ -1203,8 +1172,7 @@ static ICE_noinline int unmap_and_move(n
>  				   int force, enum migrate_mode mode,
>  				   enum migrate_reason reason)
>  {
> -	int rc = MIGRATEPAGE_SUCCESS;
> -	struct page *newpage = NULL;
> +	int rc = -EAGAIN;
>  
>  	if (!thp_migration_supported() && PageTransHuge(page))
>  		return -ENOMEM;
> @@ -1219,17 +1187,57 @@ static ICE_noinline int unmap_and_move(n
>  				__ClearPageIsolated(page);
>  			unlock_page(page);
>  		}
> +		rc = MIGRATEPAGE_SUCCESS;
>  		goto out;
>  	}
>  
> -	newpage = get_new_page(page, private);
> -	if (!newpage)
> -		return -ENOMEM;
> +	if (!trylock_page(page)) {
> +		if (!force || mode == MIGRATE_ASYNC)
> +			return rc;
>  
> -	rc = __unmap_and_move(page, newpage, force, mode);
> -	if (rc == MIGRATEPAGE_SUCCESS)
> -		set_page_owner_migrate_reason(newpage, reason);
> +		/*
> +		 * It's not safe for direct compaction to call lock_page.
> +		 * For example, during page readahead pages are added locked
> +		 * to the LRU. Later, when the IO completes the pages are
> +		 * marked uptodate and unlocked. However, the queueing
> +		 * could be merging multiple pages for one bio (e.g.
> +		 * mpage_readpages). If an allocation happens for the
> +		 * second or third page, the process can end up locking
> +		 * the same page twice and deadlocking. Rather than
> +		 * trying to be clever about what pages can be locked,
> +		 * avoid the use of lock_page for direct compaction
> +		 * altogether.
> +		 */
> +		if (current->flags & PF_MEMALLOC)
> +			return rc;
> +
> +		lock_page(page);
> +	}
> +
> +	if (PageWriteback(page)) {
> +		/*
> +		 * Only in the case of a full synchronous migration is it
> +		 * necessary to wait for PageWriteback. In the async case,
> +		 * the retry loop is too short and in the sync-light case,
> +		 * the overhead of stalling is too much
> +		 */
> +		switch (mode) {
> +		case MIGRATE_SYNC:
> +		case MIGRATE_SYNC_NO_COPY:
> +			break;
> +		default:
> +			rc = -EBUSY;
> +			goto out_unlock;
> +		}
> +		if (!force)
> +			goto out_unlock;
> +		wait_on_page_writeback(page);
> +	}
> +	rc = __unmap_and_move(get_new_page, put_new_page, private,
> +			      page, mode, reason);
>  
> +out_unlock:
> +	unlock_page(page);
>  out:
>  	if (rc != -EAGAIN) {
>  		/*
> @@ -1269,9 +1277,8 @@ out:
>  		if (rc != -EAGAIN) {
>  			if (likely(!__PageMovable(page))) {
>  				putback_lru_page(page);
> -				goto put_new;
> +				goto done;
>  			}
> -
>  			lock_page(page);
>  			if (PageMovable(page))
>  				putback_movable_page(page);
> @@ -1280,13 +1287,8 @@ out:
>  			unlock_page(page);
>  			put_page(page);
>  		}
> -put_new:
> -		if (put_new_page)
> -			put_new_page(newpage, private);
> -		else
> -			put_page(newpage);
>  	}
> -
> +done:
>  	return rc;
>  }
>  
> _


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  5:41       ` David Rientjes
@ 2020-07-01  8:54         ` Huang, Ying
  2020-07-01 18:20           ` Dave Hansen
  2020-07-01 15:15         ` Dave Hansen
  2020-07-01 17:21         ` Yang Shi
  2 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2020-07-01  8:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yang Shi, Dave Hansen, linux-kernel, linux-mm, kbusch, dan.j.williams

David Rientjes <rientjes@google.com> writes:

> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>> > > From: Dave Hansen <dave.hansen@linux.intel.com>
>> > > 
>> > > If a memory node has a preferred migration path to demote cold pages,
>> > > attempt to move those inactive pages to that migration node before
>> > > reclaiming. This will better utilize available memory, provide a faster
>> > > tier than swapping or discarding, and allow such pages to be reused
>> > > immediately without IO to retrieve the data.
>> > > 
>> > > When handling anonymous pages, this will be considered before swap if
>> > > enabled. Should the demotion fail for any reason, the page reclaim
>> > > will proceed as if the demotion feature was not enabled.
>> > > 
>> > Thanks for sharing these patches and kick-starting the conversation, Dave.
>> > 
>> > Could this cause us to break a user's mbind() or allow a user to
>> > circumvent their cpuset.mems?
>> > 
>> > Because we don't have a mapping of the page back to its allocation
>> > context (or the process context in which it was allocated), it seems like
>> > both are possible.
>> 
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>> 
>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from 
> the system perspective, however, but rather the socket perspective.  In 
> other words, a node can only demote to a series of exclusive pmem ranges 
> and promote to the same series of ranges in reverse order.  So DRAM node 0 
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem 
> just to be slower volatile memory and we don't need to deal with the 
> latency concerns of cross socket migration.  A user page will never be 
> demoted to a pmem range across the socket and will never be promoted to a 
> different DRAM node that it doesn't have access to.
>
> That can work with the NUMA abstraction for pmem, but it could also 
> theoretically be a new memory zone instead.  If all memory living on pmem 
> is migratable (the natural way that memory hotplug is done, so we can 
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
> would determine whether we can allocate directly from this memory based on 
> system config or a new gfp flag that could be set for users of a mempolicy 
> that allows allocations directly from pmem.  If abstracted as a NUMA node 
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
> make much sense.

Why can not we just bind the memory of the application to node 0, 2, 3
via mbind() or cpuset.mems?  Then the application can allocate memory
directly from PMEM.  And if we bind the memory of the application via
mbind() to node 0, we can only allocate memory directly from DRAM.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] [PATCH 0/8] Migrate Pages in lieu of discard
  2020-06-30 19:31       ` Dave Hansen
@ 2020-07-01 14:24         ` Zi Yan
  2020-07-01 14:32           ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Zi Yan @ 2020-07-01 14:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Shakeel Butt, Dave Hansen, LKML, Linux MM, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams


[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On 30 Jun 2020, at 15:31, Dave Hansen wrote:

>
>
>> BTW is this proposal only for systems having multi-tiers of memory?
>> Can a multi-node DRAM-only system take advantage of this proposal? For
>> example I have a system with two DRAM nodes running two jobs
>> hardwalled to each node. For each job the other node is kind of
>> low-tier memory. If I can describe the per-job demotion paths then
>> these jobs can take advantage of this proposal during occasional
>> peaks.
>
> I don't see any reason it could not work there.  There would just need
> to be a way to set up a different demotion path policy that what was
> done here.

We might need a different threshold (or GFP flag) for allocating new pages
in remote node for demotion. Otherwise, we could
see scenarios like: two nodes in a system are almost full and Node A is
trying to demote some pages to Node B, which triggers page demotion from
Node B to Node A. Then, we might be able to avoid a demotion cycle by not
allowing Node A to demote pages again but swapping pages to disk when Node B
is demoting its pages to Node A, but this still leads to a long reclaim path
compared to making Node A swapping to disk directly. In such cases, Node A
should just swap pages to disk without bothering Node B at all.

Maybe something like GFP_DEMOTION flag for allocating pages for demotion and
the flag requires more free pages available in the destination node to
avoid the situation above?


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] [PATCH 0/8] Migrate Pages in lieu of discard
  2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
@ 2020-07-01 14:32           ` Dave Hansen
  0 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 14:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Shakeel Butt, Dave Hansen, LKML, Linux MM, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 7/1/20 7:24 AM, Zi Yan wrote:
> On 30 Jun 2020, at 15:31, Dave Hansen wrote:
>>> BTW is this proposal only for systems having multi-tiers of
>>> memory? Can a multi-node DRAM-only system take advantage of
>>> this proposal? For example I have a system with two DRAM nodes
>>> running two jobs hardwalled to each node. For each job the
>>> other node is kind of low-tier memory. If I can describe the
>>> per-job demotion paths then these jobs can take advantage of
>>> this proposal during occasional peaks.
>> I don't see any reason it could not work there.  There would just
>> need to be a way to set up a different demotion path policy that
>> what was done here.
> We might need a different threshold (or GFP flag) for allocating
> new pages in remote node for demotion. Otherwise, we could see
> scenarios like: two nodes in a system are almost full and Node A
> is trying to demote some pages to Node B, which triggers page
> demotion from Node B to Node A.

I've always assumed that migration cycles would be illegal since it's
so hard to guarantee forward reclaim progress with them in place.
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCAAGBQJe/J5kAAoJEGg1lTBwyZKw0TMP/1kufbxVGSY331xhOL/QHEoE
Tsuo62l2CJ/CbhIBKzac24k1Rf9AiyxUukkVZfa32c2Kf03XWjUNiWVuRPSTMlfT
E0h2llYYbUBs+eVeT4Ksz4xkThKHlXPNuS1OMhuSVbjhieiPqp3J0blohXaWdkSa
DBgpiqNlVPD7V0NIA5qfsumZRrOJDdJNdLKbjI7GBVprEHu5N/X0NQpakPErtcka
kSz7Hjv5x+fbd3rxc2QhrnegBE1oMQGUl14nf/kIKnKuZV2WIdabaxrYWrQBvALa
Z2sfcBRU41/SKvz/syCwJpSr1XkfsjNKvDMlkflXndMTzzP4/rhAyDX5Wzw99Aws
zz6UmRhZrFOudq4R5jpOqJiDfn1RGYA8mH04bEOPjEgGRiXaxi5Sp6fh/BQG5p7n
QESH0LVHEhg8h+10FWZ5VYU1UwMIdzolBI8Y8VlJDjeSpzSFyyDFP7Re3OyQRfmb
ij5ThSozo35t+zEYS4yofgPMZKJ/aZ+EySEF5LZsipKC2RsRuFFpaDSOOGXZKLXq
G/R9g2LeLZK6iNNlCrIGjeAAKN8UZzOMJwapYV8czt0HTQ2vRjuDE1Y2TRD6gjXI
x6vUCfFyOEJw4l3mca+Sb1pmFcaiXBRxBrat6q23Ls+eCDMIaTgx5wA7NEeq0Td7
yShQbtIvJKRubiscJlZ/
=MjgB
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
  2020-07-01  8:47   ` Greg Thelen
@ 2020-07-01 14:46     ` Dave Hansen
  2020-07-01 18:32       ` Yang Shi
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 14:46 UTC (permalink / raw)
  To: Greg Thelen, Dave Hansen, linux-kernel
  Cc: linux-mm, kbusch, yang.shi, rientjes, ying.huang, dan.j.williams

On 7/1/20 1:47 AM, Greg Thelen wrote:
> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> From: Keith Busch <kbusch@kernel.org>
>> Defer allocating the page until we are actually ready to make use of
>> it, after locking the original page. This simplifies error handling,
>> but should not have any functional change in behavior. This is just
>> refactoring page migration so the main part can more easily be reused
>> by other code.
> 
> Is there any concern that the src page is now held PG_locked over the
> dst page allocation, which might wander into
> reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
> just wondering about the additional latency imposed on unrelated threads
> who want access src page.

It's not great.  *But*, the alternative is to toss the page contents out
and let users encounter a fault and an allocation.  They would be
subject to all the latency associated with an allocation, just at a
slightly later time.

If it's a problem it seems like it would be pretty easy to fix, at least
for non-cgroup reclaim.  We know which node we're reclaiming from and we
know if it has a demotion path, so we could proactively allocate a
single migration target page before doing the source lock_page().  That
creates some other problems, but I think it would be straightforward.

>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
> 
> Is commented Signed-off-by intentional?  Same applies to later patches.

Yes, Keith is no longer at Intel, so that @intel.com mail would bounce.
 I left the @intel.com SoB so it would be clear that the code originated
from Keith while at Intel, but commented it out to avoid it being picked
up by anyone's tooling.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  5:41       ` David Rientjes
  2020-07-01  8:54         ` Huang, Ying
@ 2020-07-01 15:15         ` Dave Hansen
  2020-07-01 17:21         ` Yang Shi
  2 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 15:15 UTC (permalink / raw)
  To: David Rientjes, Yang Shi
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, ying.huang, dan.j.williams

On 6/30/20 10:41 PM, David Rientjes wrote:
> Maybe the strongest advantage of the node abstraction is the ability to 
> use autonuma and migrate_pages()/move_pages() API for moving pages 
> explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I totally agree that we _could_ introduce this new memory class as a zone.

Doing it as nodes is pretty natural since the firmware today describes
both slow (versus DRAM) and fast memory as separate nodes.  It also
means that apps can get visibility into placement with existing NUMA
tooling and ABIs.  To me, those are the two strongest reasons for PMEM.

Looking to the future, I don't think the zone approach scales.  I know
folks want to build stuff within a single socket which is a mix of:

1. High-Bandwidth, on-package memory (a la MCDRAM)
2. DRAM
3. DRAM-cached PMEM (aka. "memory mode" PMEM)
4. Non-cached PMEM

Right now, #1 doesn't exist on modern platform  and #3/#4 can't be mixed
(you only get 3 _or_ 4 at once).  I'd love to provide something here
that Intel can use to build future crazy platform configurations that
don't require kernel enabling.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-30 17:50     ` Yang Shi
  2020-07-01  0:48       ` Huang, Ying
@ 2020-07-01 16:02       ` Dave Hansen
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 16:02 UTC (permalink / raw)
  To: Yang Shi, Huang, Ying, Dave Hansen
  Cc: linux-kernel, linux-mm, rientjes, dan.j.williams

On 6/30/20 10:50 AM, Yang Shi wrote:
> So, I'm supposed you need check if node_reclaim is enabled before doing
> migration in shrink_page_list() and also need make node reclaim to adopt
> the new mode.
> 
> Please refer to
> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
> 
> I copied the related chunks here:

Thanks for those!  I'll incorporate them for the next version.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  0:47   ` David Rientjes
  2020-07-01  1:29     ` Yang Shi
  2020-07-01  1:40     ` Huang, Ying
@ 2020-07-01 16:48     ` Dave Hansen
  2020-07-01 19:25       ` David Rientjes
  2 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 16:48 UTC (permalink / raw)
  To: David Rientjes, Dave Hansen
  Cc: linux-kernel, linux-mm, kbusch, yang.shi, ying.huang, dan.j.williams

On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> 
> Thanks for sharing these patches and kick-starting the conversation, Dave.
> 
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
 The auto-migration only kicks in when the data is about to go away.  So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.
> 
> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?

There's nothing explicit.  On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
> 
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed.  That's the sketchiest of the three.  :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> 
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

	DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>  			; /* try to reclaim the page below */
>>  		}
>>  
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> 
> nr_reclaimed += nr_pages instead?

Oh, good catch.  I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  5:41       ` David Rientjes
  2020-07-01  8:54         ` Huang, Ying
  2020-07-01 15:15         ` Dave Hansen
@ 2020-07-01 17:21         ` Yang Shi
  2020-07-01 19:45           ` David Rientjes
  2 siblings, 1 reply; 43+ messages in thread
From: Yang Shi @ 2020-07-01 17:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, ying.huang, dan.j.williams



On 6/30/20 10:41 PM, David Rientjes wrote:
> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>>
>>>> If a memory node has a preferred migration path to demote cold pages,
>>>> attempt to move those inactive pages to that migration node before
>>>> reclaiming. This will better utilize available memory, provide a faster
>>>> tier than swapping or discarding, and allow such pages to be reused
>>>> immediately without IO to retrieve the data.
>>>>
>>>> When handling anonymous pages, this will be considered before swap if
>>>> enabled. Should the demotion fail for any reason, the page reclaim
>>>> will proceed as if the demotion feature was not enabled.
>>>>
>>> Thanks for sharing these patches and kick-starting the conversation, Dave.
>>>
>>> Could this cause us to break a user's mbind() or allow a user to
>>> circumvent their cpuset.mems?
>>>
>>> Because we don't have a mapping of the page back to its allocation
>>> context (or the process context in which it was allocated), it seems like
>>> both are possible.
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from
> the system perspective, however, but rather the socket perspective.  In
> other words, a node can only demote to a series of exclusive pmem ranges
> and promote to the same series of ranges in reverse order.  So DRAM node 0
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem
> just to be slower volatile memory and we don't need to deal with the
> latency concerns of cross socket migration.  A user page will never be
> demoted to a pmem range across the socket and will never be promoted to a
> different DRAM node that it doesn't have access to.

But I don't see too much benefit to limit the migration target to the 
so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on 
a different socket) pmem node since even the cross socket access should 
be much faster then refault or swap from disk.

>
> That can work with the NUMA abstraction for pmem, but it could also
> theoretically be a new memory zone instead.  If all memory living on pmem
> is migratable (the natural way that memory hotplug is done, so we can
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering
> would determine whether we can allocate directly from this memory based on
> system config or a new gfp flag that could be set for users of a mempolicy
> that allows allocations directly from pmem.  If abstracted as a NUMA node
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
> make much sense.
>
> Kswapd would need to be enlightened for proper pgdat and pmem balancing
> but in theory it should be simpler because it only has its own node to
> manage.  Existing per-zone watermarks might be easy to use to fine tune
> the policy from userspace: the scale factor determines how much memory we
> try to keep free on DRAM for migration from pmem, for example.  We also
> wouldn't have to deal with node hotplug or updating of demotion/promotion
> node chains.
>
> Maybe the strongest advantage of the node abstraction is the ability to
> use autonuma and migrate_pages()/move_pages() API for moving pages
> explicitly?  Mempolicies could be used for migration to "top-tier" memory,
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I think using pmem as a node is more natural than zone and less 
intrusive since we can just reuse all the numa APIs. If we treat pmem as 
a new zone I think the implementation may be more intrusive and 
complicated (i.e. need a new gfp flag) and user can't control the memory 
placement.

Actually there had been such proposal before, please see 
https://www.spinics.net/lists/linux-mm/msg151788.html




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01  8:54         ` Huang, Ying
@ 2020-07-01 18:20           ` Dave Hansen
  2020-07-01 19:50             ` David Rientjes
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 18:20 UTC (permalink / raw)
  To: Huang, Ying, David Rientjes
  Cc: Yang Shi, Dave Hansen, linux-kernel, linux-mm, kbusch, dan.j.williams

On 7/1/20 1:54 AM, Huang, Ying wrote:
> Why can not we just bind the memory of the application to node 0, 2, 3
> via mbind() or cpuset.mems?  Then the application can allocate memory
> directly from PMEM.  And if we bind the memory of the application via
> mbind() to node 0, we can only allocate memory directly from DRAM.

Applications use cpuset.mems precisely because they don't want to
allocate directly from PMEM.  They want the good, deterministic,
performance they get from DRAM.

Even if they don't allocate directly from PMEM, is it OK for such an app
to get its cold data migrated to PMEM?  That's a much more subtle
question and I suspect the kernel isn't going to have a single answer
for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
on or off.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order
  2020-06-30  8:22   ` Huang, Ying
@ 2020-07-01 18:23     ` Dave Hansen
  2020-07-02  1:20       ` Huang, Ying
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-07-01 18:23 UTC (permalink / raw)
  To: Huang, Ying, Dave Hansen
  Cc: linux-kernel, linux-mm, yang.shi, rientjes, dan.j.williams

On 6/30/20 1:22 AM, Huang, Ying wrote:
>> +	/*
>> +	 * To avoid cycles in the migration "graph", ensure
>> +	 * that migration sources are not future targets by
>> +	 * setting them in 'used_targets'.
>> +	 *
>> +	 * But, do this only once per pass so that multiple
>> +	 * source nodes can share a target node.
> establish_migrate_target() calls find_next_best_node(), which will set
> target_node in used_targets.  So it seems that the nodes_or() below is
> only necessary to initialize used_targets, and multiple source nodes
> cannot share one target node in current implementation.

Yes, that is true.  My focus on this implementation was simplicity and
sanity for common configurations.  I can certainly imagine scenarios
where this is suboptimal.

I'm totally open to other ways of doing this.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
  2020-07-01 14:46     ` Dave Hansen
@ 2020-07-01 18:32       ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2020-07-01 18:32 UTC (permalink / raw)
  To: Dave Hansen, Greg Thelen, Dave Hansen, linux-kernel
  Cc: linux-mm, kbusch, rientjes, ying.huang, dan.j.williams



On 7/1/20 7:46 AM, Dave Hansen wrote:
> On 7/1/20 1:47 AM, Greg Thelen wrote:
>> Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>> From: Keith Busch <kbusch@kernel.org>
>>> Defer allocating the page until we are actually ready to make use of
>>> it, after locking the original page. This simplifies error handling,
>>> but should not have any functional change in behavior. This is just
>>> refactoring page migration so the main part can more easily be reused
>>> by other code.
>> Is there any concern that the src page is now held PG_locked over the
>> dst page allocation, which might wander into
>> reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
>> just wondering about the additional latency imposed on unrelated threads
>> who want access src page.
> It's not great.  *But*, the alternative is to toss the page contents out
> and let users encounter a fault and an allocation.  They would be
> subject to all the latency associated with an allocation, just at a
> slightly later time.
>
> If it's a problem it seems like it would be pretty easy to fix, at least
> for non-cgroup reclaim.  We know which node we're reclaiming from and we
> know if it has a demotion path, so we could proactively allocate a
> single migration target page before doing the source lock_page().  That
> creates some other problems, but I think it would be straightforward.

If so this patch looks pointless if I read it correctly. The patch 
defers page allocation in __unmap_and_move() under page lock so that 
__unmap_and _move() can be called in reclaim path since the src page is 
locked in reclaim path before calling __unmap_and_move() otherwise it 
would deadlock itself.

Actually you always allocate target page with src page locked with this 
implementation unless you move the target page allocation before 
shrink_page_list(), but the problem is you don't know how many pages you 
need allocate.

The alternative may be to unlock the src page then allocate target page 
then lock src page again. But if so why not just call migrate_pages() 
directly as I did in my series? It put the src page on a separate list 
then unlock it, then migrate themn in batch later.

>>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
>> Is commented Signed-off-by intentional?  Same applies to later patches.
> Yes, Keith is no longer at Intel, so that @intel.com mail would bounce.
>   I left the @intel.com SoB so it would be clear that the code originated
> from Keith while at Intel, but commented it out to avoid it being picked
> up by anyone's tooling.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 16:48     ` Dave Hansen
@ 2020-07-01 19:25       ` David Rientjes
  2020-07-02  5:02         ` Huang, Ying
  0 siblings, 1 reply; 43+ messages in thread
From: David Rientjes @ 2020-07-01 19:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, yang.shi,
	ying.huang, dan.j.williams

On Wed, 1 Jul 2020, Dave Hansen wrote:

> > Could this cause us to break a user's mbind() or allow a user to 
> > circumvent their cpuset.mems?
> 
> In its current form, yes.
> 
> My current rationale for this is that while it's not as deferential as
> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>  The auto-migration only kicks in when the data is about to go away.  So
> while the user's data might be slower than they like, it is *WAY* faster
> than they deserve because it should be off on the disk.
> 

It's outside the scope of this patchset, but eventually there will be a 
promotion path that I think requires a strict 1:1 relationship between 
DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
cpuset.mems become ineffective for nodes facing memory pressure.

For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
N1.  On promotion, I think we need to rely on something stronger than 
autonuma to decide which DRAM node to promote to: specifically any user 
policy put into effect (memory tiering or autonuma shouldn't be allowed to 
subvert these user policies).

As others have mentioned, we lose the allocation or process context at the 
time of demotion or promotion and any workaround for that requires some 
hacks, such as mapping the page to cpuset (what is the right solution for 
shared pages?) or adding NUMA locality handling to memcg.

I think a 1:1 relationship between DRAM and PMEM nodes is required if we 
consider the eventual promotion of this memory so that user memory can't 
eventually reappear on a DRAM node that is not allowed by mbind(), 
set_mempolicy(), or cpuset.mems.  I think it also makes this patchset much 
simpler.

> > Because we don't have a mapping of the page back to its allocation 
> > context (or the process context in which it was allocated), it seems like 
> > both are possible.
> > 
> > So let's assume that migration nodes cannot be other DRAM nodes.  
> > Otherwise, memory pressure could be intentionally or unintentionally 
> > induced to migrate these pages to another node.  Do we have such a 
> > restriction on migration nodes?
> 
> There's nothing explicit.  On a normal, balanced system where there's a
> 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
> implicit since the migration path is one deep and goes from DRAM->PMEM.
> 
> If there were some oddball system where there was a memory only DRAM
> node, it might very well end up being a migration target.
> 

Shouldn't DRAM->DRAM demotion be banned?  It's all DRAM and within the 
control of mempolicies and cpusets today, so I had assumed this is outside 
the scope of memory tiering support.  I had assumed that memory tiering 
support was all about separate tiers :)

> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> >> +{
> >> +	/*
> >> +	 * 'mask' targets allocation only to the desired node in the
> >> +	 * migration path, and fails fast if the allocation can not be
> >> +	 * immediately satisfied.  Reclaim is already active and heroic
> >> +	 * allocation efforts are unwanted.
> >> +	 */
> >> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> >> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> >> +			__GFP_MOVABLE;
> > 
> > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> > actually want to kick kswapd on the pmem node?
> 
> In my mental model, cold data flows from:
> 
> 	DRAM -> PMEM -> swap
> 
> Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
> for kinda cold data, kswapd can be working on doing the PMEM->swap part
> on really cold data.
> 

Makes sense.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 17:21         ` Yang Shi
@ 2020-07-01 19:45           ` David Rientjes
  2020-07-02 10:02             ` Jonathan Cameron
  0 siblings, 1 reply; 43+ messages in thread
From: David Rientjes @ 2020-07-01 19:45 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, linux-kernel, linux-mm, kbusch, ying.huang, dan.j.williams

On Wed, 1 Jul 2020, Yang Shi wrote:

> > We can do this if we consider pmem not to be a separate memory tier from
> > the system perspective, however, but rather the socket perspective.  In
> > other words, a node can only demote to a series of exclusive pmem ranges
> > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > one DRAM node.
> > 
> > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > just to be slower volatile memory and we don't need to deal with the
> > latency concerns of cross socket migration.  A user page will never be
> > demoted to a pmem range across the socket and will never be promoted to a
> > different DRAM node that it doesn't have access to.
> 
> But I don't see too much benefit to limit the migration target to the
> so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> different socket) pmem node since even the cross socket access should be much
> faster then refault or swap from disk.
> 

Hi Yang,

Right, but any eventual promotion path would allow this to subvert the 
user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
to a DRAM node on its socket.  We've discussed not having the ability to 
map from the demoted page to either of these contexts and it becomes more 
difficult for shared memory.  We have page_to_nid() and page_zone() so we 
can always find the appropriate demotion or promotion node for a given 
page if there is a 1:1 relationship.

Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
nodes?  It seems much simpler in terms of implementation and is more 
intuitive.

> I think using pmem as a node is more natural than zone and less intrusive
> since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> think the implementation may be more intrusive and complicated (i.e. need a
> new gfp flag) and user can't control the memory placement.
> 

This is an important decision to make, I'm not sure that we actually 
*want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
migrate_pages() back to DRAM and cause other memory to be demoted in its 
place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
then puts the kswapd for the PMEM node under pressure to reclaim it serves 
*only* to spend unnecessary cpu cycles.

Users could control the memory placement through a new mempolicy flag, 
which I think are needed anyway for explicit allocation policies for PMEM 
nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
relationship with DRAM, now your system only has nodes {0,1} as today, no 
new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
can then mlock() if I want to disable demotion on memory pressure).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 18:20           ` Dave Hansen
@ 2020-07-01 19:50             ` David Rientjes
  2020-07-02  1:50               ` Huang, Ying
  0 siblings, 1 reply; 43+ messages in thread
From: David Rientjes @ 2020-07-01 19:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Huang, Ying, Yang Shi, Dave Hansen, linux-kernel, linux-mm,
	kbusch, dan.j.williams

On Wed, 1 Jul 2020, Dave Hansen wrote:

> Even if they don't allocate directly from PMEM, is it OK for such an app
> to get its cold data migrated to PMEM?  That's a much more subtle
> question and I suspect the kernel isn't going to have a single answer
> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
> on or off.
> 

I think the answer is whether the app's cold data can be reclaimed, 
otherwise migration to PMEM is likely better in terms of performance.  So 
any such app today should just be mlocking its cold data if it can't 
handle overhead from reclaim?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order
  2020-07-01 18:23     ` Dave Hansen
@ 2020-07-02  1:20       ` Huang, Ying
  0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-02  1:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, linux-kernel, linux-mm, yang.shi, rientjes, dan.j.williams

Dave Hansen <dave.hansen@intel.com> writes:

> On 6/30/20 1:22 AM, Huang, Ying wrote:
>>> +	/*
>>> +	 * To avoid cycles in the migration "graph", ensure
>>> +	 * that migration sources are not future targets by
>>> +	 * setting them in 'used_targets'.
>>> +	 *
>>> +	 * But, do this only once per pass so that multiple
>>> +	 * source nodes can share a target node.
>> establish_migrate_target() calls find_next_best_node(), which will set
>> target_node in used_targets.  So it seems that the nodes_or() below is
>> only necessary to initialize used_targets, and multiple source nodes
>> cannot share one target node in current implementation.
>
> Yes, that is true.  My focus on this implementation was simplicity and
> sanity for common configurations.  I can certainly imagine scenarios
> where this is suboptimal.
>
> I'm totally open to other ways of doing this.

OK.  So when we really need to share one target node for multiple source
nodes, we can add a parameter to find_next_best_node() to specify
whether set target_node in used_targets.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 19:50             ` David Rientjes
@ 2020-07-02  1:50               ` Huang, Ying
  0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-02  1:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, Yang Shi, Dave Hansen, linux-kernel, linux-mm,
	kbusch, dan.j.williams

David Rientjes <rientjes@google.com> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> Even if they don't allocate directly from PMEM, is it OK for such an app
>> to get its cold data migrated to PMEM?  That's a much more subtle
>> question and I suspect the kernel isn't going to have a single answer
>> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
>> on or off.
>> 
>
> I think the answer is whether the app's cold data can be reclaimed, 
> otherwise migration to PMEM is likely better in terms of performance.  So 
> any such app today should just be mlocking its cold data if it can't 
> handle overhead from reclaim?

Yes.  That's a way to solve the problem.  A cpuset-level knob may be
more flexible, because you don't need to change the application source
code.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 19:25       ` David Rientjes
@ 2020-07-02  5:02         ` Huang, Ying
  0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-02  5:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, Dave Hansen, linux-kernel, linux-mm, kbusch,
	yang.shi, dan.j.williams

David Rientjes <rientjes@google.com> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> > Could this cause us to break a user's mbind() or allow a user to 
>> > circumvent their cpuset.mems?
>> 
>> In its current form, yes.
>> 
>> My current rationale for this is that while it's not as deferential as
>> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>>  The auto-migration only kicks in when the data is about to go away.  So
>> while the user's data might be slower than they like, it is *WAY* faster
>> than they deserve because it should be off on the disk.
>> 
>
> It's outside the scope of this patchset, but eventually there will be a 
> promotion path that I think requires a strict 1:1 relationship between 
> DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
> cpuset.mems become ineffective for nodes facing memory pressure.

I have posted an patchset for AutoNUMA based promotion support,

https://lore.kernel.org/lkml/20200218082634.1596727-1-ying.huang@intel.com/

Where, the page is promoted upon NUMA hint page fault.  So all memory
policy (mbind(), set_mempolicy(), and cpuset.mems) are available.  We
can refuse promoting the page to the DRAM nodes that are not allowed by
any memory policy.  So, 1:1 relationship isn't necessary for promotion.

> For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
> perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
> then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
> N1.  On promotion, I think we need to rely on something stronger than 
> autonuma to decide which DRAM node to promote to: specifically any user 
> policy put into effect (memory tiering or autonuma shouldn't be allowed to 
> subvert these user policies).
>
> As others have mentioned, we lose the allocation or process context at the 
> time of demotion or promotion

As above, we have process context at time of promotion.

> and any workaround for that requires some 
> hacks, such as mapping the page to cpuset (what is the right solution for 
> shared pages?) or adding NUMA locality handling to memcg.

It sounds natural to me to add NUMA nodes restriction to memcg.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
  2020-07-01 19:45           ` David Rientjes
@ 2020-07-02 10:02             ` Jonathan Cameron
  0 siblings, 0 replies; 43+ messages in thread
From: Jonathan Cameron @ 2020-07-02 10:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Yang Shi, Dave Hansen, linux-kernel, linux-mm, kbusch,
	ying.huang, dan.j.williams

On Wed, 1 Jul 2020 12:45:17 -0700
David Rientjes <rientjes@google.com> wrote:

> On Wed, 1 Jul 2020, Yang Shi wrote:
> 
> > > We can do this if we consider pmem not to be a separate memory tier from
> > > the system perspective, however, but rather the socket perspective.  In
> > > other words, a node can only demote to a series of exclusive pmem ranges
> > > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > > one DRAM node.
> > > 
> > > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > > just to be slower volatile memory and we don't need to deal with the
> > > latency concerns of cross socket migration.  A user page will never be
> > > demoted to a pmem range across the socket and will never be promoted to a
> > > different DRAM node that it doesn't have access to.  
> > 
> > But I don't see too much benefit to limit the migration target to the
> > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> > different socket) pmem node since even the cross socket access should be much
> > faster then refault or swap from disk.
> >   
> 
> Hi Yang,
> 
> Right, but any eventual promotion path would allow this to subvert the 
> user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
> to a DRAM node on its socket.  We've discussed not having the ability to 
> map from the demoted page to either of these contexts and it becomes more 
> difficult for shared memory.  We have page_to_nid() and page_zone() so we 
> can always find the appropriate demotion or promotion node for a given 
> page if there is a 1:1 relationship.
> 
> Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
> nodes?  It seems much simpler in terms of implementation and is more 
> intuitive.
Hi David, Yang,

The 1:1 mapping implies a particular system topology.  In the medium
term we are likely to see systems with a central pool of persistent memory
with equal access characteristics from multiple CPU containing nodes, each
with local DRAM. 

Clearly we could fake a split of such a pmem pool to keep the 1:1 mapping
but it's certainly not elegant and may be very wasteful for resources.

Can a zone based approach work well without such a hard wall?

Jonathan

> 
> > I think using pmem as a node is more natural than zone and less intrusive
> > since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> > think the implementation may be more intrusive and complicated (i.e. need a
> > new gfp flag) and user can't control the memory placement.
> >   
> 
> This is an important decision to make, I'm not sure that we actually 
> *want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
> migrate_pages() back to DRAM and cause other memory to be demoted in its 
> place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
> Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
> then puts the kswapd for the PMEM node under pressure to reclaim it serves 
> *only* to spend unnecessary cpu cycles.
> 
> Users could control the memory placement through a new mempolicy flag, 
> which I think are needed anyway for explicit allocation policies for PMEM 
> nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
> relationship with DRAM, now your system only has nodes {0,1} as today, no 
> new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
> specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
> can then mlock() if I want to disable demotion on memory pressure).
> 




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
  2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
  2020-06-30  7:23   ` Huang, Ying
@ 2020-07-03  9:30   ` Huang, Ying
  1 sibling, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2020-07-03  9:30 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, yang.shi, rientjes, dan.j.williams

Dave Hansen <dave.hansen@linux.intel.com> writes:
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *arg)
> +{
> +	switch (action) {
> +	case MEM_GOING_OFFLINE:
> +		/*
> +		 * Make sure there are not transient states where
> +		 * an offline node is a migration target.  This
> +		 * will leave migration disabled until the offline
> +		 * completes and the MEM_OFFLINE case below runs.
> +		 */
> +		disable_all_migrate_targets();
> +		break;
> +	case MEM_OFFLINE:
> +	case MEM_ONLINE:
> +		/*
> +		 * Recalculate the target nodes once the node
> +		 * reaches its final state (online or offline).
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_CANCEL_OFFLINE:
> +		/*
> +		 * MEM_GOING_OFFLINE disabled all the migration
> +		 * targets.  Reenable them.
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_GOING_ONLINE:
> +	case MEM_CANCEL_ONLINE:
> +		break;

I think we need to call
disable_all_migrate_targets()/set_migration_target_nodes() for CPU
online/offline event too.  Because that will influence node_state(nid,
N_CPU).  Which will influence node demotion relationship.

> +	}
> +
> +	return notifier_from_errno(0);
>  }
> +

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, back to index

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen
2020-07-01 19:25       ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git