Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Hansen <dave.hansen@linux.intel.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org,Dave Hansen
	<dave.hansen@linux.intel.com>,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com
Subject: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration
Date: Mon, 29 Jun 2020 16:45:17 -0700
Message-ID: <20200629234517.A7EC4BD3@viggo.jf.intel.com> (raw)
In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com>


From: Dave Hansen <dave.hansen@linux.intel.com>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled.  This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
 b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
 b/mm/vmscan.c                             |    7 +--
 3 files changed, 73 insertions(+), 4 deletions(-)

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@ This is value OR'ed together of
 1	(bit currently ignored)
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -965,3 +966,11 @@ of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
 	 * Avoid any oddities like cycles that could occur
 	 * from changes in the topology.  This will leave
 	 * a momentary gap when migration is disabled.
+	 *
+	 * This is superfluous for memory offlining since
+	 * MEM_GOING_OFFLINE does it independently, but it
+	 * does not hurt to do it a second time.
 	 */
 	disable_all_migrate_targets();
 
@@ -3211,6 +3216,60 @@ again:
 	/* Is another pass necessary? */
 	if (!nodes_empty(next_pass))
 		goto again;
+}
 
-	put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
 }
+
+static int __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
  * ABI.  New bits are OK, but existing bits can never change.
  */
-#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
+#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
_


  parent reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen
2020-07-01 19:25       ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45 ` Dave Hansen [this message]
2020-06-30  7:23   ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200629234517.A7EC4BD3@viggo.jf.intel.com \
    --to=dave.hansen@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git