linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
@ 2019-10-16 22:11 Dave Hansen
  2019-10-16 22:11 ` [PATCH 1/4] node: Define and export memory migration path Dave Hansen
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-16 22:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dan.j.williams, Dave Hansen

We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Other things need to be built to
promote hot pages back to DRAM.

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from here.  It includes
autonuma-based hot page promotion back to DRAM:

	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] node: Define and export memory migration path
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
@ 2019-10-16 22:11 ` Dave Hansen
  2019-10-17 11:12   ` Kirill A. Shutemov
  2019-10-16 22:11 ` [PATCH 2/4] mm/migrate: Defer allocating new page until needed Dave Hansen
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2019-10-16 22:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dan.j.williams, Dave Hansen, keith.busch


From: Keith Busch <keith.busch@intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed, and this rule is enforced if the user tries to make
such a cycle.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/drivers/base/node.c  |   73 +++++++++++++++++++++++++++++++++++++++++++++++++
 b/include/linux/node.h |    6 ++++
 2 files changed, 79 insertions(+)

diff -puN drivers/base/node.c~0003-node-Define-and-export-memory-migration-path drivers/base/node.c
--- a/drivers/base/node.c~0003-node-Define-and-export-memory-migration-path	2019-10-16 15:06:55.895952599 -0700
+++ b/drivers/base/node.c	2019-10-16 15:06:55.902952599 -0700
@@ -101,6 +101,10 @@ static const struct attribute_group *nod
 	NULL,
 };
 
+#define TERMINAL_NODE -1
+static int node_migration[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = TERMINAL_NODE};
+static DEFINE_SPINLOCK(node_migration_lock);
+
 static void node_remove_accesses(struct node *node)
 {
 	struct node_access_nodes *c, *cnext;
@@ -530,6 +534,74 @@ static ssize_t node_read_distance(struct
 }
 static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
+static ssize_t migration_path_show(struct device *dev,
+				   struct device_attribute *attr,
+				   char *buf)
+{
+	return sprintf(buf, "%d\n", node_migration[dev->id]);
+}
+
+static ssize_t migration_path_store(struct device *dev,
+				    struct device_attribute *attr,
+				    const char *buf, size_t count)
+{
+	int i, err, nid = dev->id;
+	nodemask_t visited = NODE_MASK_NONE;
+	long next;
+
+	err = kstrtol(buf, 0, &next);
+	if (err)
+		return -EINVAL;
+
+	if (next < 0) {
+		spin_lock(&node_migration_lock);
+		WRITE_ONCE(node_migration[nid], TERMINAL_NODE);
+		spin_unlock(&node_migration_lock);
+		return count;
+	}
+	if (next >= MAX_NUMNODES || !node_online(next))
+		return -EINVAL;
+
+	/*
+	 * Follow the entire migration path from 'nid' through the point where
+	 * we hit a TERMINAL_NODE.
+	 *
+	 * Don't allow loops migration cycles in the path.
+	 */
+	node_set(nid, visited);
+	spin_lock(&node_migration_lock);
+	for (i = next; node_migration[i] != TERMINAL_NODE;
+	     i = node_migration[i]) {
+		/* Fail if we have visited this node already */
+		if (node_test_and_set(i, visited)) {
+			spin_unlock(&node_migration_lock);
+			return -EINVAL;
+		}
+	}
+	WRITE_ONCE(node_migration[nid], next);
+	spin_unlock(&node_migration_lock);
+
+	return count;
+}
+static DEVICE_ATTR_RW(migration_path);
+
+/**
+ * next_migration_node() - Get the next node in the migration path
+ * @current_node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the migration path hierarchy from
+ * 	     @current_node; -1 if @current_node is terminal or its migration
+ * 	     node is not online.
+ */
+int next_migration_node(int current_node)
+{
+	int nid = READ_ONCE(node_migration[current_node]);
+
+	if (nid >= 0 && node_online(nid))
+		return nid;
+	return TERMINAL_NODE;
+}
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_cpumap.attr,
 	&dev_attr_cpulist.attr,
@@ -537,6 +609,7 @@ static struct attribute *node_dev_attrs[
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+	&dev_attr_migration_path.attr,
 	NULL
 };
 ATTRIBUTE_GROUPS(node_dev);
diff -puN include/linux/node.h~0003-node-Define-and-export-memory-migration-path include/linux/node.h
--- a/include/linux/node.h~0003-node-Define-and-export-memory-migration-path	2019-10-16 15:06:55.898952599 -0700
+++ b/include/linux/node.h	2019-10-16 15:06:55.902952599 -0700
@@ -134,6 +134,7 @@ static inline int register_one_node(int
 	return error;
 }
 
+extern int next_migration_node(int current_node);
 extern void unregister_one_node(int nid);
 extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
@@ -186,6 +187,11 @@ static inline void register_hugetlbfs_wi
 						node_registration_func_t unreg)
 {
 }
+
+static inline int next_migration_node(int current_node)
+{
+	return -1;
+}
 #endif
 
 #define to_node(device) container_of(device, struct node, dev)
_

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/4] mm/migrate: Defer allocating new page until needed
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
  2019-10-16 22:11 ` [PATCH 1/4] node: Define and export memory migration path Dave Hansen
@ 2019-10-16 22:11 ` Dave Hansen
  2019-10-17 11:27   ` Kirill A. Shutemov
  2019-10-16 22:11 ` [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2019-10-16 22:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dan.j.williams, Dave Hansen, keith.busch


From: Keith Busch <keith.busch@intel.com>

Migrating pages had been allocating the new page before it was actually
needed. Subsequent operations may still fail, which would have to handle
cleaning up the newly allocated page when it was never used.

Defer allocating the page until we are actually ready to make use of
it, after locking the original page. This simplifies error handling,
but should not have any functional change in behavior. This is just
refactoring page migration so the main part can more easily be reused
by other code.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/migrate.c |  154 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 76 insertions(+), 78 deletions(-)

diff -puN mm/migrate.c~0004-mm-migrate-Defer-allocating-new-page-until-needed mm/migrate.c
--- a/mm/migrate.c~0004-mm-migrate-Defer-allocating-new-page-until-needed	2019-10-16 15:06:57.032952596 -0700
+++ b/mm/migrate.c	2019-10-16 15:06:57.037952596 -0700
@@ -1005,56 +1005,17 @@ out:
 	return rc;
 }
 
-static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, enum migrate_mode mode)
+static int __unmap_and_move(new_page_t get_new_page,
+			    free_page_t put_new_page,
+			    unsigned long private, struct page *page,
+			    enum migrate_mode mode,
+			    enum migrate_reason reason)
 {
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
 	bool is_lru = !__PageMovable(page);
-
-	if (!trylock_page(page)) {
-		if (!force || mode == MIGRATE_ASYNC)
-			goto out;
-
-		/*
-		 * It's not safe for direct compaction to call lock_page.
-		 * For example, during page readahead pages are added locked
-		 * to the LRU. Later, when the IO completes the pages are
-		 * marked uptodate and unlocked. However, the queueing
-		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
-		 * second or third page, the process can end up locking
-		 * the same page twice and deadlocking. Rather than
-		 * trying to be clever about what pages can be locked,
-		 * avoid the use of lock_page for direct compaction
-		 * altogether.
-		 */
-		if (current->flags & PF_MEMALLOC)
-			goto out;
-
-		lock_page(page);
-	}
-
-	if (PageWriteback(page)) {
-		/*
-		 * Only in the case of a full synchronous migration is it
-		 * necessary to wait for PageWriteback. In the async case,
-		 * the retry loop is too short and in the sync-light case,
-		 * the overhead of stalling is too much
-		 */
-		switch (mode) {
-		case MIGRATE_SYNC:
-		case MIGRATE_SYNC_NO_COPY:
-			break;
-		default:
-			rc = -EBUSY;
-			goto out_unlock;
-		}
-		if (!force)
-			goto out_unlock;
-		wait_on_page_writeback(page);
-	}
+	struct page *newpage;
 
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
@@ -1073,6 +1034,12 @@ static int __unmap_and_move(struct page
 	if (PageAnon(page) && !PageKsm(page))
 		anon_vma = page_get_anon_vma(page);
 
+	newpage = get_new_page(page, private);
+	if (!newpage) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
 	/*
 	 * Block others from accessing the new page when we get around to
 	 * establishing additional references. We are usually the only one
@@ -1082,11 +1049,11 @@ static int __unmap_and_move(struct page
 	 * This is much like races on refcount of oldpage: just don't BUG().
 	 */
 	if (unlikely(!trylock_page(newpage)))
-		goto out_unlock;
+		goto out_put;
 
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
-		goto out_unlock_both;
+		goto out_unlock;
 	}
 
 	/*
@@ -1105,7 +1072,7 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto out_unlock_both;
+			goto out_unlock;
 		}
 	} else if (page_mapped(page)) {
 		/* Establish migration ptes */
@@ -1122,15 +1089,9 @@ static int __unmap_and_move(struct page
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
-
-out_unlock_both:
-	unlock_page(newpage);
 out_unlock:
-	/* Drop an anon_vma reference if we took one */
-	if (anon_vma)
-		put_anon_vma(anon_vma);
-	unlock_page(page);
-out:
+	unlock_page(newpage);
+out_put:
 	/*
 	 * If migration is successful, decrease refcount of the newpage
 	 * which will not free the page because new page owner increased
@@ -1141,12 +1102,20 @@ out:
 	 * state.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
+		set_page_owner_migrate_reason(newpage, reason);
 		if (unlikely(!is_lru))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
+	} else if (put_new_page) {
+		put_new_page(newpage, private);
+	} else {
+		put_page(newpage);
 	}
-
+out:
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma)
+		put_anon_vma(anon_vma);
 	return rc;
 }
 
@@ -1171,16 +1140,11 @@ static ICE_noinline int unmap_and_move(n
 				   int force, enum migrate_mode mode,
 				   enum migrate_reason reason)
 {
-	int rc = MIGRATEPAGE_SUCCESS;
-	struct page *newpage;
+	int rc = -EAGAIN;
 
 	if (!thp_migration_supported() && PageTransHuge(page))
 		return -ENOMEM;
 
-	newpage = get_new_page(page, private);
-	if (!newpage)
-		return -ENOMEM;
-
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
@@ -1191,17 +1155,57 @@ static ICE_noinline int unmap_and_move(n
 				__ClearPageIsolated(page);
 			unlock_page(page);
 		}
-		if (put_new_page)
-			put_new_page(newpage, private);
-		else
-			put_page(newpage);
+		rc = MIGRATEPAGE_SUCCESS;
 		goto out;
 	}
 
-	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS)
-		set_page_owner_migrate_reason(newpage, reason);
+	if (!trylock_page(page)) {
+		if (!force || mode == MIGRATE_ASYNC)
+			return rc;
+
+		/*
+		 * It's not safe for direct compaction to call lock_page.
+		 * For example, during page readahead pages are added locked
+		 * to the LRU. Later, when the IO completes the pages are
+		 * marked uptodate and unlocked. However, the queueing
+		 * could be merging multiple pages for one bio (e.g.
+		 * mpage_readpages). If an allocation happens for the
+		 * second or third page, the process can end up locking
+		 * the same page twice and deadlocking. Rather than
+		 * trying to be clever about what pages can be locked,
+		 * avoid the use of lock_page for direct compaction
+		 * altogether.
+		 */
+		if (current->flags & PF_MEMALLOC)
+			return rc;
 
+		lock_page(page);
+	}
+
+	if (PageWriteback(page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			rc = -EBUSY;
+			goto out_unlock;
+		}
+		if (!force)
+			goto out_unlock;
+		wait_on_page_writeback(page);
+	}
+	rc = __unmap_and_move(get_new_page, put_new_page, private,
+			      page, mode, reason);
+
+out_unlock:
+	unlock_page(page);
 out:
 	if (rc != -EAGAIN) {
 		/*
@@ -1242,9 +1246,8 @@ out:
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
 				putback_lru_page(page);
-				goto put_new;
+				goto done;
 			}
-
 			lock_page(page);
 			if (PageMovable(page))
 				putback_movable_page(page);
@@ -1253,13 +1256,8 @@ out:
 			unlock_page(page);
 			put_page(page);
 		}
-put_new:
-		if (put_new_page)
-			put_new_page(newpage, private);
-		else
-			put_page(newpage);
 	}
-
+done:
 	return rc;
 }
 
_

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
  2019-10-16 22:11 ` [PATCH 1/4] node: Define and export memory migration path Dave Hansen
  2019-10-16 22:11 ` [PATCH 2/4] mm/migrate: Defer allocating new page until needed Dave Hansen
@ 2019-10-16 22:11 ` Dave Hansen
  2019-10-17 17:30   ` Yang Shi
  2019-10-16 22:11 ` [PATCH 4/4] mm/vmscan: Consider anonymous pages without swap Dave Hansen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2019-10-16 22:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dan.j.williams, Dave Hansen, keith.busch


From: Keith Busch <keith.busch@intel.com>

If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.

Much like swap, this is an opt-in feature that requires user defining
where to send pages when reclaiming them. When handling anonymous pages,
this will be considered before swap if enabled. Should the demotion fail
for any reason, the page reclaim will proceed as if the demotion feature
was not enabled.

Some places we would like to see this used:

  1. Persistent memory being as a slower, cheaper DRAM replacement
  2. Remote memory-only "expansion" NUMA nodes
  3. Resolving memory imbalances where one NUMA node is seeing more
     allocation activity than another.  This helps keep more recent
     allocations closer to the CPUs on the node doing the allocating.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/include/linux/migrate.h        |    6 ++++
 b/include/trace/events/migrate.h |    3 +-
 b/mm/debug.c                     |    1 
 b/mm/migrate.c                   |   51 +++++++++++++++++++++++++++++++++++++++
 b/mm/vmscan.c                    |   27 ++++++++++++++++++++
 5 files changed, 87 insertions(+), 1 deletion(-)

diff -puN include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
--- a/include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2019-10-16 15:06:58.090952593 -0700
+++ b/include/linux/migrate.h	2019-10-16 15:06:58.103952593 -0700
@@ -25,6 +25,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -79,6 +80,7 @@ extern int migrate_huge_page_move_mappin
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, enum migrate_mode mode,
 		int extra_count);
+extern int migrate_demote_mapping(struct page *page);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -105,6 +107,10 @@ static inline int migrate_huge_page_move
 	return -ENOSYS;
 }
 
+static inline int migrate_demote_mapping(struct page *page)
+{
+	return -ENOSYS;
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
diff -puN include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2019-10-16 15:06:58.092952593 -0700
+++ b/include/trace/events/migrate.h	2019-10-16 15:06:58.103952593 -0700
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTION,	"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
--- a/mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2019-10-16 15:06:58.094952593 -0700
+++ b/mm/debug.c	2019-10-16 15:06:58.103952593 -0700
@@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demotion",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff -puN mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
--- a/mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2019-10-16 15:06:58.097952593 -0700
+++ b/mm/migrate.c	2019-10-16 15:06:58.104952593 -0700
@@ -1119,6 +1119,57 @@ out:
 	return rc;
 }
 
+static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
+{
+	/*
+	 * The flags are set to allocate only on the desired node in the
+	 * migration path, and to fail fast if not immediately available. We
+	 * are already doing memory reclaim, we don't want heroic efforts to
+	 * get a page.
+	 */
+	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_MOVABLE;
+	struct page *newpage;
+
+	if (PageTransHuge(page)) {
+		mask |= __GFP_COMP;
+		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
+		if (newpage)
+			prep_transhuge_page(newpage);
+	} else
+		newpage = alloc_pages_node(node, mask, 0);
+
+	return newpage;
+}
+
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ *                            demotion node.
+ * @page: A locked, isolated, non-huge page that should migrate to its current
+ *        node's demotion target, if available. Since this is intended to be
+ *        called during memory reclaim, all flag options are set to fail fast.
+ *
+ * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
+ */
+int migrate_demote_mapping(struct page *page)
+{
+	int next_nid = next_migration_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid < 0)
+		return -ENOSYS;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return -ENOMEM;
+
+	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
+	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+				page, MIGRATE_ASYNC, MR_DEMOTION);
+}
+
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.
diff -puN mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
--- a/mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2019-10-16 15:06:58.099952593 -0700
+++ b/mm/vmscan.c	2019-10-16 15:06:58.105952593 -0700
@@ -1262,6 +1262,33 @@ static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
+		if (!PageHuge(page)) {
+			int rc = migrate_demote_mapping(page);
+
+			/*
+			 * -ENOMEM on a THP may indicate either migration is
+			 * unsupported or there was not enough contiguous
+			 * space. Split the THP into base pages and retry the
+			 * head immediately. The tail pages will be considered
+			 * individually within the current loop's page list.
+			 */
+			if (rc == -ENOMEM && PageTransHuge(page) &&
+			    !split_huge_page_to_list(page, page_list))
+				rc = migrate_demote_mapping(page);
+
+			if (rc == MIGRATEPAGE_SUCCESS) {
+				unlock_page(page);
+				if (likely(put_page_testzero(page)))
+					goto free_it;
+				/*
+				 * Speculative reference will free this page,
+				 * so leave it off the LRU.
+				 */
+				nr_reclaimed++;
+				continue;
+			}
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
_

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/4] mm/vmscan: Consider anonymous pages without swap
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
                   ` (2 preceding siblings ...)
  2019-10-16 22:11 ` [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
@ 2019-10-16 22:11 ` Dave Hansen
  2019-10-17  3:45 ` [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Shakeel Butt
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-16 22:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dan.j.williams, Dave Hansen, keith.busch


From: Keith Busch <keith.busch@intel.com>

Age and reclaim anonymous pages if a migration path is available. The
node has other recourses for inactive anonymous pages beyond swap,

Signed-off-by: Keith Busch <keith.busch@intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/include/linux/swap.h |   20 ++++++++++++++++++++
 b/mm/vmscan.c          |   10 +++++-----
 2 files changed, 25 insertions(+), 5 deletions(-)

diff -puN include/linux/swap.h~0006-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/swap.h
--- a/include/linux/swap.h~0006-mm-vmscan-Consider-anonymous-pages-without-swap	2019-10-16 15:06:59.474952590 -0700
+++ b/include/linux/swap.h	2019-10-16 15:06:59.481952590 -0700
@@ -680,5 +680,25 @@ static inline bool mem_cgroup_swap_full(
 }
 #endif
 
+static inline bool reclaim_anon_pages(struct mem_cgroup *memcg,
+				      int node_id)
+{
+	/* Always age anon pages when we have swap */
+	if (memcg == NULL) {
+		if (get_nr_swap_pages() > 0)
+			return true;
+	} else {
+		if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+			return true;
+	}
+
+	/* Also age anon pages if we can auto-migrate them */
+	if (next_migration_node(node_id) >= 0)
+		return true;
+
+	/* No way to reclaim anon pages */
+	return false;
+}
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff -puN mm/vmscan.c~0006-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c
--- a/mm/vmscan.c~0006-mm-vmscan-Consider-anonymous-pages-without-swap	2019-10-16 15:06:59.477952590 -0700
+++ b/mm/vmscan.c	2019-10-16 15:06:59.482952590 -0700
@@ -327,7 +327,7 @@ unsigned long zone_reclaimable_pages(str
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
 		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (reclaim_anon_pages(NULL, zone_to_nid(zone)))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2166,7 +2166,7 @@ static bool inactive_list_is_low(struct
 	 * If we don't have swap space, anonymous page deactivation
 	 * is pointless.
 	 */
-	if (!file && !total_swap_pages)
+	if (!file && !reclaim_anon_pages(NULL, pgdat->node_id))
 		return false;
 
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2241,7 +2241,7 @@ static void get_scan_count(struct lruvec
 	enum lru_list lru;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+	if (!sc->may_swap || !reclaim_anon_pages(memcg, pgdat->node_id)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2604,7 +2604,7 @@ static inline bool should_continue_recla
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (!reclaim_anon_pages(NULL, pgdat->node_id))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
@@ -3289,7 +3289,7 @@ static void age_active_anon(struct pglis
 {
 	struct mem_cgroup *memcg;
 
-	if (!total_swap_pages)
+	if (!reclaim_anon_pages(NULL, pgdat->node_id))
 		return;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
_

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
                   ` (3 preceding siblings ...)
  2019-10-16 22:11 ` [PATCH 4/4] mm/vmscan: Consider anonymous pages without swap Dave Hansen
@ 2019-10-17  3:45 ` Shakeel Butt
  2019-10-17 14:26   ` Dave Hansen
  2019-10-17 16:01 ` Suleiman Souhlal
  2019-10-18  7:44 ` Michal Hocko
  6 siblings, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-10-17  3:45 UTC (permalink / raw)
  To: Dave Hansen; +Cc: LKML, Linux MM, Dan Williams, Jonathan Adams

On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

The memory cgroup part of the story is missing here. Since PMEM is
treated as slow DRAM, shouldn't its usage be accounted to the
corresponding memcg's memory/memsw counters and the migration should
not happen for memcg limit reclaim? Otherwise some jobs can hog the
whole PMEM.

Also what happens when PMEM is full? Can the memory migrated to PMEM
be reclaimed (or discarded)?

Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] node: Define and export memory migration path
  2019-10-16 22:11 ` [PATCH 1/4] node: Define and export memory migration path Dave Hansen
@ 2019-10-17 11:12   ` Kirill A. Shutemov
  2019-10-17 11:44     ` Kirill A. Shutemov
  0 siblings, 1 reply; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-10-17 11:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, dan.j.williams, keith.busch

On Wed, Oct 16, 2019 at 03:11:49PM -0700, Dave Hansen wrote:
> 
> From: Keith Busch <keith.busch@intel.com>
> 
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a user defined node migration table. This allows creating single
> migration target for each NUMA node to enable the kernel to do NUMA
> page migrations instead of simply reclaiming colder pages. A node
> with no target is a "terminal node", so reclaim acts normally there.
> The migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
> 
> If you consider the migration path as a graph, cycles (loops) in the
> graph are disallowed.  This avoids wasting resources by constantly
> migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
> never be allowed, and this rule is enforced if the user tries to make
> such a cycle.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/drivers/base/node.c  |   73 +++++++++++++++++++++++++++++++++++++++++++++++++
>  b/include/linux/node.h |    6 ++++
>  2 files changed, 79 insertions(+)
> 
> diff -puN drivers/base/node.c~0003-node-Define-and-export-memory-migration-path drivers/base/node.c
> --- a/drivers/base/node.c~0003-node-Define-and-export-memory-migration-path	2019-10-16 15:06:55.895952599 -0700
> +++ b/drivers/base/node.c	2019-10-16 15:06:55.902952599 -0700
> @@ -101,6 +101,10 @@ static const struct attribute_group *nod
>  	NULL,
>  };
>  
> +#define TERMINAL_NODE -1

Wouldn't we have a confusion with NUMA_NO_NODE, which is also -1?

> +static int node_migration[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = TERMINAL_NODE};

This is the first time is see range initializer in kernel code. It is GCC
extension. Do we use it anywhere already?

Many distributions compile kernel with NODES_SHIFT==10, which means this
array will take 4k even on single node machine.

Should it be dynamic?

> +static DEFINE_SPINLOCK(node_migration_lock);
> +
>  static void node_remove_accesses(struct node *node)
>  {
>  	struct node_access_nodes *c, *cnext;
> @@ -530,6 +534,74 @@ static ssize_t node_read_distance(struct
>  }
>  static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
>  
> +static ssize_t migration_path_show(struct device *dev,
> +				   struct device_attribute *attr,
> +				   char *buf)
> +{
> +	return sprintf(buf, "%d\n", node_migration[dev->id]);
> +}
> +
> +static ssize_t migration_path_store(struct device *dev,
> +				    struct device_attribute *attr,
> +				    const char *buf, size_t count)
> +{
> +	int i, err, nid = dev->id;
> +	nodemask_t visited = NODE_MASK_NONE;
> +	long next;
> +
> +	err = kstrtol(buf, 0, &next);
> +	if (err)
> +		return -EINVAL;
> +
> +	if (next < 0) {

Any negative number to set it to terminal node? Why not limit it to -1?
We may find use for user negative numbers later.

> +		spin_lock(&node_migration_lock);
> +		WRITE_ONCE(node_migration[nid], TERMINAL_NODE);
> +		spin_unlock(&node_migration_lock);
> +		return count;
> +	}
> +	if (next >= MAX_NUMNODES || !node_online(next))
> +		return -EINVAL;

What prevents offlining after the check?

> +	/*
> +	 * Follow the entire migration path from 'nid' through the point where
> +	 * we hit a TERMINAL_NODE.
> +	 *
> +	 * Don't allow loops migration cycles in the path.
> +	 */
> +	node_set(nid, visited);
> +	spin_lock(&node_migration_lock);
> +	for (i = next; node_migration[i] != TERMINAL_NODE;
> +	     i = node_migration[i]) {
> +		/* Fail if we have visited this node already */
> +		if (node_test_and_set(i, visited)) {
> +			spin_unlock(&node_migration_lock);
> +			return -EINVAL;
> +		}
> +	}
> +	WRITE_ONCE(node_migration[nid], next);
> +	spin_unlock(&node_migration_lock);
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(migration_path);
> +
> +/**
> + * next_migration_node() - Get the next node in the migration path
> + * @current_node: The starting node to lookup the next node
> + *
> + * @returns: node id for next memory node in the migration path hierarchy from
> + * 	     @current_node; -1 if @current_node is terminal or its migration
> + * 	     node is not online.
> + */
> +int next_migration_node(int current_node)
> +{
> +	int nid = READ_ONCE(node_migration[current_node]);
> +
> +	if (nid >= 0 && node_online(nid))
> +		return nid;
> +	return TERMINAL_NODE;
> +}
> +
>  static struct attribute *node_dev_attrs[] = {
>  	&dev_attr_cpumap.attr,
>  	&dev_attr_cpulist.attr,
> @@ -537,6 +609,7 @@ static struct attribute *node_dev_attrs[
>  	&dev_attr_numastat.attr,
>  	&dev_attr_distance.attr,
>  	&dev_attr_vmstat.attr,
> +	&dev_attr_migration_path.attr,
>  	NULL
>  };
>  ATTRIBUTE_GROUPS(node_dev);
> diff -puN include/linux/node.h~0003-node-Define-and-export-memory-migration-path include/linux/node.h
> --- a/include/linux/node.h~0003-node-Define-and-export-memory-migration-path	2019-10-16 15:06:55.898952599 -0700
> +++ b/include/linux/node.h	2019-10-16 15:06:55.902952599 -0700
> @@ -134,6 +134,7 @@ static inline int register_one_node(int
>  	return error;
>  }
>  
> +extern int next_migration_node(int current_node);
>  extern void unregister_one_node(int nid);
>  extern int register_cpu_under_node(unsigned int cpu, unsigned int nid);
>  extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
> @@ -186,6 +187,11 @@ static inline void register_hugetlbfs_wi
>  						node_registration_func_t unreg)
>  {
>  }
> +
> +static inline int next_migration_node(int current_node)
> +{
> +	return -1;
> +}
>  #endif
>  
>  #define to_node(device) container_of(device, struct node, dev)
> _
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/4] mm/migrate: Defer allocating new page until needed
  2019-10-16 22:11 ` [PATCH 2/4] mm/migrate: Defer allocating new page until needed Dave Hansen
@ 2019-10-17 11:27   ` Kirill A. Shutemov
  0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-10-17 11:27 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, dan.j.williams, keith.busch

On Wed, Oct 16, 2019 at 03:11:51PM -0700, Dave Hansen wrote:
> 
> From: Keith Busch <keith.busch@intel.com>
> 
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
> 
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Well, the functional change I see is that now we allocate a new page under
page lock of old page.

It *should* be fine, but it has to be call out in the commit message.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] node: Define and export memory migration path
  2019-10-17 11:12   ` Kirill A. Shutemov
@ 2019-10-17 11:44     ` Kirill A. Shutemov
  0 siblings, 0 replies; 30+ messages in thread
From: Kirill A. Shutemov @ 2019-10-17 11:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, dan.j.williams, keith.busch

On Thu, Oct 17, 2019 at 02:12:05PM +0300, Kirill A. Shutemov wrote:
> > +		spin_lock(&node_migration_lock);
> > +		WRITE_ONCE(node_migration[nid], TERMINAL_NODE);
> > +		spin_unlock(&node_migration_lock);
> > +		return count;
> > +	}
> > +	if (next >= MAX_NUMNODES || !node_online(next))
> > +		return -EINVAL;
> 
> What prevents offlining after the check?

And what is story with memory hotplug interaction? I don't see any hooks
into memory hotplug to adjust migration path on offlining. Hm?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17  3:45 ` [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Shakeel Butt
@ 2019-10-17 14:26   ` Dave Hansen
  2019-10-17 16:58     ` Shakeel Butt
  2019-10-17 17:20     ` Yang Shi
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-17 14:26 UTC (permalink / raw)
  To: Shakeel Butt, Dave Hansen
  Cc: LKML, Linux MM, Dan Williams, Jonathan Adams, Chen, Tim C

On 10/16/19 8:45 PM, Shakeel Butt wrote:
> On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> This set implements a solution to these problems.  At the end of the
>> reclaim process in shrink_page_list() just before the last page
>> refcount is dropped, the page is migrated to persistent memory instead
>> of being dropped.
..> The memory cgroup part of the story is missing here. Since PMEM is
> treated as slow DRAM, shouldn't its usage be accounted to the
> corresponding memcg's memory/memsw counters and the migration should
> not happen for memcg limit reclaim? Otherwise some jobs can hog the
> whole PMEM.

My expectation (and I haven't confirmed this) is that the any memory use
is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
limit reclaim and global reclaim both end up doing migrations and
neither should have a net effect on the counters.

There is certainly a problem here because DRAM is a more valuable
resource vs. PMEM, and memcg accounts for them as if they were equally
valuable.  I really want to see memcg account for this cost discrepancy
at some point, but I'm not quite sure what form it would take.  Any
feedback from you heavy memcg users out there would be much appreciated.

> Also what happens when PMEM is full? Can the memory migrated to PMEM
> be reclaimed (or discarded)?

Yep.  The "migration path" can be as long as you want, but once the data
hits a "terminal node" it will stop getting migrated and normal discard
at the end of reclaim happens.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
                   ` (4 preceding siblings ...)
  2019-10-17  3:45 ` [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Shakeel Butt
@ 2019-10-17 16:01 ` Suleiman Souhlal
  2019-10-17 16:32   ` Dave Hansen
  2019-10-18  7:44 ` Michal Hocko
  6 siblings, 1 reply; 30+ messages in thread
From: Suleiman Souhlal @ 2019-10-17 16:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux Kernel, linux-mm, dan.j.williams

On Thu, Oct 17, 2019 at 7:14 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>

We prototyped something very similar to this patch series in the past.

One problem that came up is that if you get into direct reclaim,
because persistent memory can have pretty low write throughput, you
can end up stalling users for a pretty long time while migrating
pages.

To mitigate that, we tried changing background reclaim to start
migrating much earlier (but not otherwise reclaiming), however it
drastically increased the code complexity and still had the chance of
not being able to catch up with pressure.

Because of that, we moved to a solution based on the proactive reclaim
of idle pages, that was presented at LSFMM earlier this year:
https://lwn.net/Articles/787611/ .

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 16:01 ` Suleiman Souhlal
@ 2019-10-17 16:32   ` Dave Hansen
  2019-10-17 16:39     ` Shakeel Butt
  2019-10-18  8:11     ` Suleiman Souhlal
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-17 16:32 UTC (permalink / raw)
  To: Suleiman Souhlal, Dave Hansen; +Cc: Linux Kernel, linux-mm, dan.j.williams

On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> One problem that came up is that if you get into direct reclaim,
> because persistent memory can have pretty low write throughput, you
> can end up stalling users for a pretty long time while migrating
> pages.

Basically, you're saying that memory load spikes turn into latency spikes?

FWIW, we have been benchmarking this sucker with benchmarks that claim
to care about latency.  In general, compared to DRAM, we do see worse
latency, but nothing catastrophic yet.  I'd be interested if you have
any workloads that act as reasonable proxies for your latency requirements.

> Because of that, we moved to a solution based on the proactive reclaim
> of idle pages, that was presented at LSFMM earlier this year:
> https://lwn.net/Articles/787611/ .

I saw the presentation.  The feedback in the room as I remember it was
that proactive reclaim essentially replaced the existing reclaim
mechanism, to which the audience was not receptive.  Have folks opinions
changed on that, or are you looking for other solutions?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 16:32   ` Dave Hansen
@ 2019-10-17 16:39     ` Shakeel Butt
  2019-10-18  8:11     ` Suleiman Souhlal
  1 sibling, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2019-10-17 16:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Suleiman Souhlal, Dave Hansen, Linux Kernel, Linux MM, Dan Williams

On Thu, Oct 17, 2019 at 9:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> > One problem that came up is that if you get into direct reclaim,
> > because persistent memory can have pretty low write throughput, you
> > can end up stalling users for a pretty long time while migrating
> > pages.
>
> Basically, you're saying that memory load spikes turn into latency spikes?
>
> FWIW, we have been benchmarking this sucker with benchmarks that claim
> to care about latency.  In general, compared to DRAM, we do see worse
> latency, but nothing catastrophic yet.  I'd be interested if you have
> any workloads that act as reasonable proxies for your latency requirements.
>
> > Because of that, we moved to a solution based on the proactive reclaim
> > of idle pages, that was presented at LSFMM earlier this year:
> > https://lwn.net/Articles/787611/ .
>
> I saw the presentation.  The feedback in the room as I remember it was
> that proactive reclaim essentially replaced the existing reclaim
> mechanism, to which the audience was not receptive.  Have folks opinions
> changed on that, or are you looking for other solutions?
>

I am currently working on a solution which shares the mechanisms
between regular and proactive reclaim. The interested users/admins can
setup proactive reclaim otherwise the regular reclaim will work on low
memory. I will have something in one/two months and will post the
patches.

Shakeel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 14:26   ` Dave Hansen
@ 2019-10-17 16:58     ` Shakeel Butt
  2019-10-17 20:51       ` Dave Hansen
  2019-10-17 17:20     ` Yang Shi
  1 sibling, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-10-17 16:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, LKML, Linux MM, Dan Williams, Jonathan Adams, Chen, Tim C

On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >> This set implements a solution to these problems.  At the end of the
> >> reclaim process in shrink_page_list() just before the last page
> >> refcount is dropped, the page is migrated to persistent memory instead
> >> of being dropped.
> ..> The memory cgroup part of the story is missing here. Since PMEM is
> > treated as slow DRAM, shouldn't its usage be accounted to the
> > corresponding memcg's memory/memsw counters and the migration should
> > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > whole PMEM.
>
> My expectation (and I haven't confirmed this) is that the any memory use
> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> limit reclaim and global reclaim both end up doing migrations and
> neither should have a net effect on the counters.
>

Hmm I didn't see the memcg charge migration in the code on demotion.
So, in the code [patch 3] the counters are being decremented as DRAM
is freed but not incremented for PMEM.

> There is certainly a problem here because DRAM is a more valuable
> resource vs. PMEM, and memcg accounts for them as if they were equally
> valuable.  I really want to see memcg account for this cost discrepancy
> at some point, but I'm not quite sure what form it would take.  Any
> feedback from you heavy memcg users out there would be much appreciated.
>

There are two apparent use-cases i.e. explicit (apps moving their
pages to PMEM to reduce cost) and implicit (admin moves cold pages to
PMEM transparently to the apps) for the PMEM. In the implicit case, I
see both DRAM and PMEM as same resource from the perspective of memcg
limits i.e. same memcg counter, something like cgroup v1's  memsw).
For the explicit case, maybe separate counters make sense like cgroup
v2's memory and swap.

> > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > be reclaimed (or discarded)?
>
> Yep.  The "migration path" can be as long as you want, but once the data
> hits a "terminal node" it will stop getting migrated and normal discard
> at the end of reclaim happens.

I might have missed it but I didn't see the migrated pages inserted
back to LRUs. If they are not in LRU, the reclaimer will never see
them.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 14:26   ` Dave Hansen
  2019-10-17 16:58     ` Shakeel Butt
@ 2019-10-17 17:20     ` Yang Shi
  2019-10-17 21:05       ` Dave Hansen
  2019-10-17 22:58       ` Shakeel Butt
  1 sibling, 2 replies; 30+ messages in thread
From: Yang Shi @ 2019-10-17 17:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Shakeel Butt, Dave Hansen, LKML, Linux MM, Dan Williams,
	Jonathan Adams, Chen, Tim C

On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >> This set implements a solution to these problems.  At the end of the
> >> reclaim process in shrink_page_list() just before the last page
> >> refcount is dropped, the page is migrated to persistent memory instead
> >> of being dropped.
> ..> The memory cgroup part of the story is missing here. Since PMEM is
> > treated as slow DRAM, shouldn't its usage be accounted to the
> > corresponding memcg's memory/memsw counters and the migration should
> > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > whole PMEM.
>
> My expectation (and I haven't confirmed this) is that the any memory use
> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> limit reclaim and global reclaim both end up doing migrations and
> neither should have a net effect on the counters.

Yes, your expectation is correct. As long as PMEM is a NUMA node, it
is treated as regular memory by memcg. But, I don't think memcg limit
reclaim should do migration since limit reclaim is used to reduce
memory usage, but migration doesn't reduce usage, it just moves memory
from one node to the other.

In my implementation, I just skip migration for memcg limit reclaim,
please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/

>
> There is certainly a problem here because DRAM is a more valuable
> resource vs. PMEM, and memcg accounts for them as if they were equally
> valuable.  I really want to see memcg account for this cost discrepancy
> at some point, but I'm not quite sure what form it would take.  Any
> feedback from you heavy memcg users out there would be much appreciated.

We did have some demands to control the ratio between DRAM and PMEM as
I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
and PMEM respectively or something similar.

>
> > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > be reclaimed (or discarded)?
>
> Yep.  The "migration path" can be as long as you want, but once the data
> hits a "terminal node" it will stop getting migrated and normal discard
> at the end of reclaim happens.

I recalled I had a hallway conversation with Keith about this in
LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
think exporting migration path to userspace (or letting user to define
migration path) and having multiple migration stops are good ideas in
general.

>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard
  2019-10-16 22:11 ` [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
@ 2019-10-17 17:30   ` Yang Shi
  2019-10-18 18:15     ` Dave Hansen
  0 siblings, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-10-17 17:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Linux MM, Dan Williams, Keith Busch

On Wed, Oct 16, 2019 at 3:14 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Keith Busch <keith.busch@intel.com>
>
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
>
> Much like swap, this is an opt-in feature that requires user defining
> where to send pages when reclaiming them. When handling anonymous pages,
> this will be considered before swap if enabled. Should the demotion fail
> for any reason, the page reclaim will proceed as if the demotion feature
> was not enabled.
>
> Some places we would like to see this used:
>
>   1. Persistent memory being as a slower, cheaper DRAM replacement
>   2. Remote memory-only "expansion" NUMA nodes
>   3. Resolving memory imbalances where one NUMA node is seeing more
>      allocation activity than another.  This helps keep more recent
>      allocations closer to the CPUs on the node doing the allocating.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>
>  b/include/linux/migrate.h        |    6 ++++
>  b/include/trace/events/migrate.h |    3 +-
>  b/mm/debug.c                     |    1
>  b/mm/migrate.c                   |   51 +++++++++++++++++++++++++++++++++++++++
>  b/mm/vmscan.c                    |   27 ++++++++++++++++++++
>  5 files changed, 87 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
> --- a/include/linux/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2019-10-16 15:06:58.090952593 -0700
> +++ b/include/linux/migrate.h   2019-10-16 15:06:58.103952593 -0700
> @@ -25,6 +25,7 @@ enum migrate_reason {
>         MR_MEMPOLICY_MBIND,
>         MR_NUMA_MISPLACED,
>         MR_CONTIG_RANGE,
> +       MR_DEMOTION,
>         MR_TYPES
>  };
>
> @@ -79,6 +80,7 @@ extern int migrate_huge_page_move_mappin
>  extern int migrate_page_move_mapping(struct address_space *mapping,
>                 struct page *newpage, struct page *page, enum migrate_mode mode,
>                 int extra_count);
> +extern int migrate_demote_mapping(struct page *page);
>  #else
>
>  static inline void putback_movable_pages(struct list_head *l) {}
> @@ -105,6 +107,10 @@ static inline int migrate_huge_page_move
>         return -ENOSYS;
>  }
>
> +static inline int migrate_demote_mapping(struct page *page)
> +{
> +       return -ENOSYS;
> +}
>  #endif /* CONFIG_MIGRATION */
>
>  #ifdef CONFIG_COMPACTION
> diff -puN include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard  2019-10-16 15:06:58.092952593 -0700
> +++ b/include/trace/events/migrate.h    2019-10-16 15:06:58.103952593 -0700
> @@ -20,7 +20,8 @@
>         EM( MR_SYSCALL,         "syscall_or_cpuset")            \
>         EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")              \
>         EM( MR_NUMA_MISPLACED,  "numa_misplaced")               \
> -       EMe(MR_CONTIG_RANGE,    "contig_range")
> +       EM( MR_CONTIG_RANGE,    "contig_range")                 \
> +       EMe(MR_DEMOTION,        "demotion")
>
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
> --- a/mm/debug.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard      2019-10-16 15:06:58.094952593 -0700
> +++ b/mm/debug.c        2019-10-16 15:06:58.103952593 -0700
> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>         "mempolicy_mbind",
>         "numa_misplaced",
>         "cma",
> +       "demotion",
>  };
>
>  const struct trace_print_flags pageflag_names[] = {
> diff -puN mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
> --- a/mm/migrate.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard    2019-10-16 15:06:58.097952593 -0700
> +++ b/mm/migrate.c      2019-10-16 15:06:58.104952593 -0700
> @@ -1119,6 +1119,57 @@ out:
>         return rc;
>  }
>
> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> +{
> +       /*
> +        * The flags are set to allocate only on the desired node in the
> +        * migration path, and to fail fast if not immediately available. We
> +        * are already doing memory reclaim, we don't want heroic efforts to
> +        * get a page.
> +        */
> +       gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> +                       __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_MOVABLE;
> +       struct page *newpage;
> +
> +       if (PageTransHuge(page)) {
> +               mask |= __GFP_COMP;
> +               newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
> +               if (newpage)
> +                       prep_transhuge_page(newpage);
> +       } else
> +               newpage = alloc_pages_node(node, mask, 0);
> +
> +       return newpage;
> +}
> +
> +/**
> + * migrate_demote_mapping() - Migrate this page and its mappings to its
> + *                            demotion node.
> + * @page: A locked, isolated, non-huge page that should migrate to its current
> + *        node's demotion target, if available. Since this is intended to be
> + *        called during memory reclaim, all flag options are set to fail fast.
> + *
> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
> + */
> +int migrate_demote_mapping(struct page *page)
> +{
> +       int next_nid = next_migration_node(page_to_nid(page));
> +
> +       VM_BUG_ON_PAGE(!PageLocked(page), page);
> +       VM_BUG_ON_PAGE(PageHuge(page), page);
> +       VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +       if (next_nid < 0)
> +               return -ENOSYS;
> +       if (PageTransHuge(page) && !thp_migration_supported())
> +               return -ENOMEM;
> +
> +       /* MIGRATE_ASYNC is the most light weight and never blocks.*/
> +       return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
> +                               page, MIGRATE_ASYNC, MR_DEMOTION);
> +}
> +
> +
>  /*
>   * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
>   * around it.
> diff -puN mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
> --- a/mm/vmscan.c~0005-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard     2019-10-16 15:06:58.099952593 -0700
> +++ b/mm/vmscan.c       2019-10-16 15:06:58.105952593 -0700
> @@ -1262,6 +1262,33 @@ static unsigned long shrink_page_list(st
>                         ; /* try to reclaim the page below */
>                 }
>
> +               if (!PageHuge(page)) {
> +                       int rc = migrate_demote_mapping(page);
> +
> +                       /*
> +                        * -ENOMEM on a THP may indicate either migration is
> +                        * unsupported or there was not enough contiguous
> +                        * space. Split the THP into base pages and retry the
> +                        * head immediately. The tail pages will be considered
> +                        * individually within the current loop's page list.
> +                        */
> +                       if (rc == -ENOMEM && PageTransHuge(page) &&
> +                           !split_huge_page_to_list(page, page_list))
> +                               rc = migrate_demote_mapping(page);

I recalled when Keith posted the patch at the first time, I raised
question about why not just migrating THP in a whole? The
migrate_pages() could handle this. If it fails, it just fallbacks to
base page.

Since the most optimistic gfp flags are used, it should not trap into
nested direct reclaim. The migrate_pages() should just return failure
then fallback to base page.

> +
> +                       if (rc == MIGRATEPAGE_SUCCESS) {
> +                               unlock_page(page);
> +                               if (likely(put_page_testzero(page)))
> +                                       goto free_it;
> +                               /*
> +                                * Speculative reference will free this page,
> +                                * so leave it off the LRU.
> +                                */
> +                               nr_reclaimed++;
> +                               continue;
> +                       }
> +               }
> +
>                 /*
>                  * Anonymous process memory has backing store?
>                  * Try to allocate it some swap space here.
> _
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 16:58     ` Shakeel Butt
@ 2019-10-17 20:51       ` Dave Hansen
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-17 20:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Hansen, LKML, Linux MM, Dan Williams, Jonathan Adams, Chen, Tim C

On 10/17/19 9:58 AM, Shakeel Butt wrote:
>> My expectation (and I haven't confirmed this) is that the any memory use
>> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
>> limit reclaim and global reclaim both end up doing migrations and
>> neither should have a net effect on the counters.
>>
> Hmm I didn't see the memcg charge migration in the code on demotion.
> So, in the code [patch 3] the counters are being decremented as DRAM
> is freed but not incremented for PMEM.

I had assumed that the migration code was doing this for me.  I'll go
make sure either way.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 17:20     ` Yang Shi
@ 2019-10-17 21:05       ` Dave Hansen
  2019-10-17 22:58       ` Shakeel Butt
  1 sibling, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-17 21:05 UTC (permalink / raw)
  To: Yang Shi
  Cc: Shakeel Butt, Dave Hansen, LKML, Linux MM, Dan Williams,
	Jonathan Adams, Chen, Tim C

On 10/17/19 10:20 AM, Yang Shi wrote:
> On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> My expectation (and I haven't confirmed this) is that the any memory use
>> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
>> limit reclaim and global reclaim both end up doing migrations and
>> neither should have a net effect on the counters.
> 
> Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> is treated as regular memory by memcg. But, I don't think memcg limit
> reclaim should do migration since limit reclaim is used to reduce
> memory usage, but migration doesn't reduce usage, it just moves memory
> from one node to the other.
> 
> In my implementation, I just skip migration for memcg limit reclaim,
> please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/

Ahh, got it.  That does make sense.  I might have to steal your
implementation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 17:20     ` Yang Shi
  2019-10-17 21:05       ` Dave Hansen
@ 2019-10-17 22:58       ` Shakeel Butt
  2019-10-18 21:44         ` Yang Shi
  1 sibling, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2019-10-17 22:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Dave Hansen, LKML, Linux MM, Dan Williams,
	Jonathan Adams, Chen, Tim C

On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > >> This set implements a solution to these problems.  At the end of the
> > >> reclaim process in shrink_page_list() just before the last page
> > >> refcount is dropped, the page is migrated to persistent memory instead
> > >> of being dropped.
> > ..> The memory cgroup part of the story is missing here. Since PMEM is
> > > treated as slow DRAM, shouldn't its usage be accounted to the
> > > corresponding memcg's memory/memsw counters and the migration should
> > > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > > whole PMEM.
> >
> > My expectation (and I haven't confirmed this) is that the any memory use
> > is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> > limit reclaim and global reclaim both end up doing migrations and
> > neither should have a net effect on the counters.
>
> Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> is treated as regular memory by memcg. But, I don't think memcg limit
> reclaim should do migration since limit reclaim is used to reduce
> memory usage, but migration doesn't reduce usage, it just moves memory
> from one node to the other.
>
> In my implementation, I just skip migration for memcg limit reclaim,
> please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/
>
> >
> > There is certainly a problem here because DRAM is a more valuable
> > resource vs. PMEM, and memcg accounts for them as if they were equally
> > valuable.  I really want to see memcg account for this cost discrepancy
> > at some point, but I'm not quite sure what form it would take.  Any
> > feedback from you heavy memcg users out there would be much appreciated.
>
> We did have some demands to control the ratio between DRAM and PMEM as
> I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
> and PMEM respectively or something similar.
>

Can you please describe how you plan to use this ratio? Are
applications supposed to use this ratio or the admins will be
adjusting this ratio? Also should it dynamically updated based on the
workload i.e. as the working set or hot pages grows we want more DRAM
and as cold pages grows we want more PMEM? Basically I am trying to
see if we have something like smart auto-numa balancing to fulfill
your use-case.

> >
> > > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > > be reclaimed (or discarded)?
> >
> > Yep.  The "migration path" can be as long as you want, but once the data
> > hits a "terminal node" it will stop getting migrated and normal discard
> > at the end of reclaim happens.
>
> I recalled I had a hallway conversation with Keith about this in
> LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
> think exporting migration path to userspace (or letting user to define
> migration path) and having multiple migration stops are good ideas in
> general.
>
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
                   ` (5 preceding siblings ...)
  2019-10-17 16:01 ` Suleiman Souhlal
@ 2019-10-18  7:44 ` Michal Hocko
  2019-10-18 14:54   ` Dave Hansen
  6 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2019-10-18  7:44 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, dan.j.williams

On Wed 16-10-19 15:11:48, Dave Hansen wrote:
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
> 
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
> 
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
> 
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
> 
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
> 
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
> 
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
> 
> 	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
> 
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
> 
> 	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

How does this compare to
http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 16:32   ` Dave Hansen
  2019-10-17 16:39     ` Shakeel Butt
@ 2019-10-18  8:11     ` Suleiman Souhlal
  2019-10-18 15:10       ` Dave Hansen
  1 sibling, 1 reply; 30+ messages in thread
From: Suleiman Souhlal @ 2019-10-18  8:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, Linux Kernel, linux-mm, dan.j.williams,
	Shakeel Butt, Jonathan Adams

On Fri, Oct 18, 2019 at 1:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> > One problem that came up is that if you get into direct reclaim,
> > because persistent memory can have pretty low write throughput, you
> > can end up stalling users for a pretty long time while migrating
> > pages.
>
> Basically, you're saying that memory load spikes turn into latency spikes?

Yes, exactly.

> FWIW, we have been benchmarking this sucker with benchmarks that claim
> to care about latency.  In general, compared to DRAM, we do see worse
> latency, but nothing catastrophic yet.  I'd be interested if you have
> any workloads that act as reasonable proxies for your latency requirements.

Sorry, I don't know of any specific workloads I can share. :-(
Maybe Jonathan or Shakeel have something more.

I realize it's not very useful without giving specific examples, but
even disregarding persistent memory, we've had latency issues with
direct reclaim when using zswap. It's been such a problem that we're
conducting experiments with not doing zswap compression in direct
reclaim (but still doing it proactively).
The low write throughput of persistent memory would make this worse.

I think the case where we're most likely to run into this is when the
machine is close to OOM situation and we end up thrashing rather than
OOM killing.

Somewhat related, I noticed that this patch series ratelimits
migrations from persistent memory to DRAM, but it might also make
sense to ratelimit migrations from DRAM to persistent memory. If all
the write bandwidth is taken by migrations, there might not be any
more available for applications accessing pages in persistent memory,
resulting in higher latency.


Another issue we ran into, that I think might also apply to this patch
series, is that because kernel memory can't be allocated on persistent
memory, it's possible for all of DRAM to get filled by user memory and
have kernel allocations fail even though there is still a lot of free
persistent memory. This is easy to trigger, just start an application
that is bigger than DRAM.
To mitigate that, we introduced a new watermark for DRAM zones above
which user memory can't be allocated, to leave some space for kernel
allocations.

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18  7:44 ` Michal Hocko
@ 2019-10-18 14:54   ` Dave Hansen
  2019-10-18 21:39     ` Yang Shi
  2019-10-22 13:49     ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Dave Hansen @ 2019-10-18 14:54 UTC (permalink / raw)
  To: Michal Hocko, Dave Hansen; +Cc: linux-kernel, linux-mm, dan.j.williams

On 10/18/19 12:44 AM, Michal Hocko wrote:
> How does this compare to
> http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com

It's a _bit_ more tied to persistent memory and it appears a bit more
tied to two tiers rather something arbitrarily deep.  They're pretty
similar conceptually although there are quite a few differences.

For instance, what I posted has a static mapping for the migration path.
 If node A is in reclaim, we always try to allocate pages on node B.
There are no restrictions on what those nodes can be.  In Yang Shi's
apporach, there's a dynamic search for a target migration node on each
migration that follows the normal alloc fallback path.  This ends up
making migration nodes special.

There are also some different choices that are pretty arbitrary.  For
instance, when you allocation a migration target page, should you cause
memory pressure on the target?

To be honest, though, I don't see anything fatally flawed with it.  It's
probably a useful exercise to factor out the common bits from the two
sets and see what we can agree on being absolutely necessary.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18  8:11     ` Suleiman Souhlal
@ 2019-10-18 15:10       ` Dave Hansen
  2019-10-18 15:39         ` Suleiman Souhlal
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2019-10-18 15:10 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Dave Hansen, Linux Kernel, linux-mm, dan.j.williams,
	Shakeel Butt, Jonathan Adams, Mel Gorman

On 10/18/19 1:11 AM, Suleiman Souhlal wrote:
> Another issue we ran into, that I think might also apply to this patch
> series, is that because kernel memory can't be allocated on persistent
> memory, it's possible for all of DRAM to get filled by user memory and
> have kernel allocations fail even though there is still a lot of free
> persistent memory. This is easy to trigger, just start an application
> that is bigger than DRAM.

Why doesn't this happen on everyone's laptops where DRAM is contended
between userspace and kernel allocations?  Does the OOM killer trigger
fast enough to save us?

> To mitigate that, we introduced a new watermark for DRAM zones above
> which user memory can't be allocated, to leave some space for kernel
> allocations.

I'd be curious why the existing users of ZONE_MOVABLE don't have to do
this?  Are there just no users of ZONE_MOVABLE?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18 15:10       ` Dave Hansen
@ 2019-10-18 15:39         ` Suleiman Souhlal
  0 siblings, 0 replies; 30+ messages in thread
From: Suleiman Souhlal @ 2019-10-18 15:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, Linux Kernel, linux-mm, dan.j.williams,
	Shakeel Butt, Jonathan Adams, Mel Gorman

On Sat, Oct 19, 2019 at 12:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/18/19 1:11 AM, Suleiman Souhlal wrote:
> > Another issue we ran into, that I think might also apply to this patch
> > series, is that because kernel memory can't be allocated on persistent
> > memory, it's possible for all of DRAM to get filled by user memory and
> > have kernel allocations fail even though there is still a lot of free
> > persistent memory. This is easy to trigger, just start an application
> > that is bigger than DRAM.
>
> Why doesn't this happen on everyone's laptops where DRAM is contended
> between userspace and kernel allocations?  Does the OOM killer trigger
> fast enough to save us?

Well in this case, there is plenty of free persistent memory on the
machine, but not any free DRAM to allocate kernel memory.
In the situation I'm describing, we end up OOMing when we, in my
opinion, shouldn't.

> > To mitigate that, we introduced a new watermark for DRAM zones above
> > which user memory can't be allocated, to leave some space for kernel
> > allocations.
>
> I'd be curious why the existing users of ZONE_MOVABLE don't have to do
> this?  Are there just no users of ZONE_MOVABLE?

That's an excellent question for which I don't currently have an answer.

I haven't had the chance to test your patch series, and it's possible
that it doesn't suffer from the issue.

-- Suleiman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard
  2019-10-17 17:30   ` Yang Shi
@ 2019-10-18 18:15     ` Dave Hansen
  2019-10-18 21:02       ` Yang Shi
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2019-10-18 18:15 UTC (permalink / raw)
  To: Yang Shi, Dave Hansen
  Cc: Linux Kernel Mailing List, Linux MM, Dan Williams, Keith Busch

On 10/17/19 10:30 AM, Yang Shi wrote:
>> +               if (!PageHuge(page)) {
>> +                       int rc = migrate_demote_mapping(page);
>> +
>> +                       /*
>> +                        * -ENOMEM on a THP may indicate either migration is
>> +                        * unsupported or there was not enough contiguous
>> +                        * space. Split the THP into base pages and retry the
>> +                        * head immediately. The tail pages will be considered
>> +                        * individually within the current loop's page list.
>> +                        */
>> +                       if (rc == -ENOMEM && PageTransHuge(page) &&
>> +                           !split_huge_page_to_list(page, page_list))
>> +                               rc = migrate_demote_mapping(page);
> I recalled when Keith posted the patch at the first time, I raised
> question about why not just migrating THP in a whole? The
> migrate_pages() could handle this. If it fails, it just fallbacks to
> base page.

There's a pair of migrate_demote_mapping()s in there.  I expected that
the first will migrate the whole THP and the second plus the split is
only used if fails the whole migration.

Am I reading it wrong?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard
  2019-10-18 18:15     ` Dave Hansen
@ 2019-10-18 21:02       ` Yang Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-10-18 21:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, Linux Kernel Mailing List, Linux MM, Dan Williams,
	Keith Busch

On Fri, Oct 18, 2019 at 11:15 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/17/19 10:30 AM, Yang Shi wrote:
> >> +               if (!PageHuge(page)) {
> >> +                       int rc = migrate_demote_mapping(page);
> >> +
> >> +                       /*
> >> +                        * -ENOMEM on a THP may indicate either migration is
> >> +                        * unsupported or there was not enough contiguous
> >> +                        * space. Split the THP into base pages and retry the
> >> +                        * head immediately. The tail pages will be considered
> >> +                        * individually within the current loop's page list.
> >> +                        */
> >> +                       if (rc == -ENOMEM && PageTransHuge(page) &&
> >> +                           !split_huge_page_to_list(page, page_list))
> >> +                               rc = migrate_demote_mapping(page);
> > I recalled when Keith posted the patch at the first time, I raised
> > question about why not just migrating THP in a whole? The
> > migrate_pages() could handle this. If it fails, it just fallbacks to
> > base page.
>
> There's a pair of migrate_demote_mapping()s in there.  I expected that
> the first will migrate the whole THP and the second plus the split is
> only used if fails the whole migration.

Ah, that's right. I mis-read the first migrate_demote_mapping(). But,
why calling migrate_demote_mapping() twice for THP and base page (if
THP is failed) instead of calling migrate_pages() that does deal with
all the cases.

>
> Am I reading it wrong?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18 14:54   ` Dave Hansen
@ 2019-10-18 21:39     ` Yang Shi
  2019-10-18 21:55       ` Dan Williams
  2019-10-22 13:49     ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: Yang Shi @ 2019-10-18 21:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Michal Hocko, Dave Hansen, Linux Kernel Mailing List, Linux MM,
	Dan Williams

On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/18/19 12:44 AM, Michal Hocko wrote:
> > How does this compare to
> > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
>
> It's a _bit_ more tied to persistent memory and it appears a bit more
> tied to two tiers rather something arbitrarily deep.  They're pretty
> similar conceptually although there are quite a few differences.

My patches do assume two tiers for now but it is not hard to extend to
multiple tiers. Since it is a RFC so I didn't make it that
complicated.

However, IMHO I really don't think supporting multiple tiers by making
the migration path configurable to admins or users is a good choice.
Memory migration caused by compaction or reclaim (not via syscall)
should be transparent to the users, it is the kernel internal
activity. It shouldn't be exposed to the end users.

I prefer firmware or OS build the migration path personally.

>
> For instance, what I posted has a static mapping for the migration path.
>  If node A is in reclaim, we always try to allocate pages on node B.
> There are no restrictions on what those nodes can be.  In Yang Shi's
> apporach, there's a dynamic search for a target migration node on each
> migration that follows the normal alloc fallback path.  This ends up
> making migration nodes special.

The reason that I didn't pursue static mapping is that the node might
be offlined or onlined, so you have to keep the mapping right every
time the node state is changed. Dynamic search just returns the
closest migration target node no matter what the topology is. It
should be not time consuming.

Actually, my patches don't restrict the migration target node has to
be PMEM, it could be any memory lower than DRAM, but it just happens
PMEM is the only available media. My patch's commit log explains this
point. Again I really prefer the firmware or HMAT or ACPI driver could
build the migration path in kernel.

In addition, DRAM node is definitely excluded from migration target
since I don't think doing such migration between DRAM nodes is a good
idea in general.

>
> There are also some different choices that are pretty arbitrary.  For
> instance, when you allocation a migration target page, should you cause
> memory pressure on the target?

Yes, those are definitely arbitrary. We do need sort of a lot of
details in the future by figuring out how real life workload work.

>
> To be honest, though, I don't see anything fatally flawed with it.  It's
> probably a useful exercise to factor out the common bits from the two
> sets and see what we can agree on being absolutely necessary.

Sure, that definitely would help us move forward.

>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-17 22:58       ` Shakeel Butt
@ 2019-10-18 21:44         ` Yang Shi
  0 siblings, 0 replies; 30+ messages in thread
From: Yang Shi @ 2019-10-18 21:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Dave Hansen, Dave Hansen, LKML, Linux MM, Dan Williams,
	Jonathan Adams, Chen, Tim C

On Thu, Oct 17, 2019 at 3:58 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > > >> This set implements a solution to these problems.  At the end of the
> > > >> reclaim process in shrink_page_list() just before the last page
> > > >> refcount is dropped, the page is migrated to persistent memory instead
> > > >> of being dropped.
> > > ..> The memory cgroup part of the story is missing here. Since PMEM is
> > > > treated as slow DRAM, shouldn't its usage be accounted to the
> > > > corresponding memcg's memory/memsw counters and the migration should
> > > > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > > > whole PMEM.
> > >
> > > My expectation (and I haven't confirmed this) is that the any memory use
> > > is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> > > limit reclaim and global reclaim both end up doing migrations and
> > > neither should have a net effect on the counters.
> >
> > Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> > is treated as regular memory by memcg. But, I don't think memcg limit
> > reclaim should do migration since limit reclaim is used to reduce
> > memory usage, but migration doesn't reduce usage, it just moves memory
> > from one node to the other.
> >
> > In my implementation, I just skip migration for memcg limit reclaim,
> > please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/
> >
> > >
> > > There is certainly a problem here because DRAM is a more valuable
> > > resource vs. PMEM, and memcg accounts for them as if they were equally
> > > valuable.  I really want to see memcg account for this cost discrepancy
> > > at some point, but I'm not quite sure what form it would take.  Any
> > > feedback from you heavy memcg users out there would be much appreciated.
> >
> > We did have some demands to control the ratio between DRAM and PMEM as
> > I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
> > and PMEM respectively or something similar.
> >
>
> Can you please describe how you plan to use this ratio? Are
> applications supposed to use this ratio or the admins will be
> adjusting this ratio? Also should it dynamically updated based on the
> workload i.e. as the working set or hot pages grows we want more DRAM
> and as cold pages grows we want more PMEM? Basically I am trying to
> see if we have something like smart auto-numa balancing to fulfill
> your use-case.

We thought it should be controlled by admins and transparent to the
end users. The ratio is fixed, but the memory could be moved between
DRAM and PMEM dynamically as long as it doesn't exceed the ratio so
that we could keep warmer data in DRAM and colder data in PMEM.

I talked this about in LSF/MM, please check this out:
https://lwn.net/Articles/787418/

>
> > >
> > > > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > > > be reclaimed (or discarded)?
> > >
> > > Yep.  The "migration path" can be as long as you want, but once the data
> > > hits a "terminal node" it will stop getting migrated and normal discard
> > > at the end of reclaim happens.
> >
> > I recalled I had a hallway conversation with Keith about this in
> > LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
> > think exporting migration path to userspace (or letting user to define
> > migration path) and having multiple migration stops are good ideas in
> > general.
> >
> > >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18 21:39     ` Yang Shi
@ 2019-10-18 21:55       ` Dan Williams
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Williams @ 2019-10-18 21:55 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Michal Hocko, Dave Hansen,
	Linux Kernel Mailing List, Linux MM

On Fri, Oct 18, 2019 at 2:40 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/18/19 12:44 AM, Michal Hocko wrote:
> > > How does this compare to
> > > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
> >
> > It's a _bit_ more tied to persistent memory and it appears a bit more
> > tied to two tiers rather something arbitrarily deep.  They're pretty
> > similar conceptually although there are quite a few differences.
>
> My patches do assume two tiers for now but it is not hard to extend to
> multiple tiers. Since it is a RFC so I didn't make it that
> complicated.
>
> However, IMHO I really don't think supporting multiple tiers by making
> the migration path configurable to admins or users is a good choice.

It's an optional override not a user requirement.

> Memory migration caused by compaction or reclaim (not via syscall)
> should be transparent to the users, it is the kernel internal
> activity. It shouldn't be exposed to the end users.
>
> I prefer firmware or OS build the migration path personally.

The OS can't, it can only trust platform firmware to tell it the
memory properties.

The BIOS likely gets the tables right most of the time, and the OS can
assume they are correct, but when things inevitably go wrong a user
override is needed. That override is more usable as an explicit
migration path rather than requiring users to manually craft and
inject custom ACPI tables. I otherwise do not see the substance behind
this objection to a migration path override.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] [RFC] Migrate Pages in lieu of discard
  2019-10-18 14:54   ` Dave Hansen
  2019-10-18 21:39     ` Yang Shi
@ 2019-10-22 13:49     ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2019-10-22 13:49 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Dave Hansen, linux-kernel, linux-mm, dan.j.williams

On Fri 18-10-19 07:54:20, Dave Hansen wrote:
> On 10/18/19 12:44 AM, Michal Hocko wrote:
> > How does this compare to
> > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
> 
> It's a _bit_ more tied to persistent memory and it appears a bit more
> tied to two tiers rather something arbitrarily deep.  They're pretty
> similar conceptually although there are quite a few differences.
> 
> For instance, what I posted has a static mapping for the migration path.
>  If node A is in reclaim, we always try to allocate pages on node B.
> There are no restrictions on what those nodes can be.  In Yang Shi's
> apporach, there's a dynamic search for a target migration node on each
> migration that follows the normal alloc fallback path.  This ends up
> making migration nodes special.

As we have discussed at LSFMM this year and there seemed to be a goog
consensus on that, the resulting implementation should be as pmem
neutral as possible. After all node migration mode sounds like a
reasonable feature even without pmem. So I would be more inclined to the
normal alloc fallback path rather than a very specific and static
migration fallback path. If that turns out impractical then sure let's
come up with something more specific but I think there is quite a long
route there because we do not really have much of an experience with
this so far.

> There are also some different choices that are pretty arbitrary.  For
> instance, when you allocation a migration target page, should you cause
> memory pressure on the target?

Those are details to really sort out and they require some
experimentation to.

> To be honest, though, I don't see anything fatally flawed with it.  It's
> probably a useful exercise to factor out the common bits from the two
> sets and see what we can agree on being absolutely necessary.

Makes sense. What would that be? Is there a real consensus on having the
new node_reclaim mode to be the configuration mechanism? Do we want to
support generic NUMA without any PMEM in place as well for starter?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-10-22 13:49 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16 22:11 [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Dave Hansen
2019-10-16 22:11 ` [PATCH 1/4] node: Define and export memory migration path Dave Hansen
2019-10-17 11:12   ` Kirill A. Shutemov
2019-10-17 11:44     ` Kirill A. Shutemov
2019-10-16 22:11 ` [PATCH 2/4] mm/migrate: Defer allocating new page until needed Dave Hansen
2019-10-17 11:27   ` Kirill A. Shutemov
2019-10-16 22:11 ` [PATCH 3/4] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2019-10-17 17:30   ` Yang Shi
2019-10-18 18:15     ` Dave Hansen
2019-10-18 21:02       ` Yang Shi
2019-10-16 22:11 ` [PATCH 4/4] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2019-10-17  3:45 ` [PATCH 0/4] [RFC] Migrate Pages in lieu of discard Shakeel Butt
2019-10-17 14:26   ` Dave Hansen
2019-10-17 16:58     ` Shakeel Butt
2019-10-17 20:51       ` Dave Hansen
2019-10-17 17:20     ` Yang Shi
2019-10-17 21:05       ` Dave Hansen
2019-10-17 22:58       ` Shakeel Butt
2019-10-18 21:44         ` Yang Shi
2019-10-17 16:01 ` Suleiman Souhlal
2019-10-17 16:32   ` Dave Hansen
2019-10-17 16:39     ` Shakeel Butt
2019-10-18  8:11     ` Suleiman Souhlal
2019-10-18 15:10       ` Dave Hansen
2019-10-18 15:39         ` Suleiman Souhlal
2019-10-18  7:44 ` Michal Hocko
2019-10-18 14:54   ` Dave Hansen
2019-10-18 21:39     ` Yang Shi
2019-10-18 21:55       ` Dan Williams
2019-10-22 13:49     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).