[RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system
@ 2020-02-18  8:26 Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

With the advent of various new memory types, there may be multiple
memory types in one system, e.g. DRAM and PMEM (persistent memory).
Because the performance and cost of the different types of memory may
be different, the memory subsystem of the machine could be called
the memory tiering system.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes.  In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node.  The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

While in autonuma, there are a set of existing mechanisms to identify
the pages recently accessed by the CPUs in a node and migrate the
pages to the node.  So we can reuse these mechanisms to build
mechanisms to optimize page placement in the memory tiering system.
This has been implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In the following patchset,

[PATCH 0/4] [RFC] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20191016221148.F9CCD155@viggo.jf.intel.com/

A mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node.  And this frees space on DRAM node for the hot PMEM pages to be
migrated to.  This has been implemented in this patchset too.

The patchset is based on the following not-yet-merged patchset
(after being rebased to v5.5):

[PATCH 0/4] [RFC] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20191016221148.F9CCD155@viggo.jf.intel.com/

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from below,

https://github.com/hying-caritas/linux/commits/autonuma-r2

With all above optimization, the score of pmbench memory accessing
benchmark with 80:20 read/write ratio and normal access address
distribution improves 116% on a 2 socket Intel server with Optane DC
Persistent Memory.

Changelog:

v2:

- Addressed comments for V1.

- Rebased on v5.5.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

With the advent of various new memory types, some machines will have
multiple memory types, e.g. DRAM and PMEM (persistent memory).
Because the performance of the different types of memory may be
different, the memory subsystem could be called memory tiering system.

In a typical memory tiering system, there are CPUs, fast memory and
slow memory in each physical NUMA node.  The CPUs and the fast memory
will be put in one logical node (called fast memory node), while the
slow memory will be put in another (faked) logical node (called slow
memory node).  And in autonuma, there are a set of mechanisms to
identify the pages recently accessed by the CPUs in a node and migrate
the pages to the node.  So the performance optimization to promote the
hot pages in slow memory node to the fast memory node in the memory
tiering system could be implemented based on the autonuma framework.

But the requirement of the hot page promotion in the memory tiering
system is different from that of the normal NUMA balancing in some
aspects.  E.g. for the hot page promotion, we can skip to scan fastest
memory node because we have nowhere to promote the hot pages to.  To
make autonuma works for both the normal NUMA balancing and the memory
tiering hot page promotion, we have defined a set of flags and made
the value of sysctl_numa_balancing_mode to be "OR" of these flags.
The flags are as follows,

- 0x0: NUMA_BALANCING_DISABLED
- 0x1: NUMA_BALANCING_NORMAL
- 0x2: NUMA_BALANCING_MEMORY_TIERING

NUMA_BALANCING_NORMAL enables the normal NUMA balancing across
sockets, while NUMA_BALANCING_MEMORY_TIERING enables the hot page
promotion across memory tiers.  They can be enabled individually or
together.  If all flags are cleared, the autonuma is disabled
completely.

The sysctl interface is extended accordingly in a backward compatible
way.

TODO:

- Update ABI document: Documentation/sysctl/kernel.txt

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/sched/sysctl.h | 5 +++++
 kernel/sched/core.c          | 9 +++------
 kernel/sysctl.c              | 7 ++++---
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..80dc5030c797 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -33,6 +33,11 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+#define NUMA_BALANCING_DISABLED		0x0
+#define NUMA_BALANCING_NORMAL		0x1
+#define NUMA_BALANCING_MEMORY_TIERING	0x2
+
+extern int sysctl_numa_balancing_mode;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..2d3f456d0ef6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2723,6 +2723,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
+int sysctl_numa_balancing_mode;

 #ifdef CONFIG_NUMA_BALANCING

@@ -2738,20 +2739,16 @@ void set_numabalancing_state(bool enabled)
 int sysctl_numa_balancing(struct ctl_table *table, int write,
 			 void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	struct ctl_table t;
 	int err;
-	int state = static_branch_likely(&sched_numa_balancing);

 	if (write && !capable(CAP_SYS_ADMIN))
 		return -EPERM;

-	t = *table;
-	t.data = &state;
-	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 	if (err < 0)
 		return err;
 	if (write)
-		set_numabalancing_state(state);
+		set_numabalancing_state(*(int *)table->data);
 	return err;
 }
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 70665934d53e..3756108bb658 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int sixty = 60;

 static int __maybe_unused neg_one = -1;
 static int __maybe_unused two = 2;
+static int __maybe_unused three = 3;
 static int __maybe_unused four = 4;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
@@ -420,12 +421,12 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "numa_balancing",
-		.data		= NULL, /* filled in by handler */
-		.maxlen		= sizeof(unsigned int),
+		.data		= &sysctl_numa_balancing_mode,
+		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= sysctl_numa_balancing,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= &three,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:57   ` Mel Gorman
  2020-02-18  8:26 ` [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

In autonuma memory tiering mode, the hot PMEM (persistent memory)
pages could be migrated to DRAM via autonuma.  But this incurs some
overhead too.  So that sometimes the workload performance may be hurt.
To avoid too much disturbing to the workload, the migration throughput
should be rate-limited.

At the other hand, in some situation, for example, some workloads
exits, many DRAM pages become free, so that some pages of the other
workloads can be migrated to DRAM.  To respond to the workloads
changing quickly, it's better to migrate pages faster.

To address the above 2 requirements, a rate limit algorithm as follows
is used,

- If there is enough free memory in DRAM node (that is, > high
  watermark + 2 * rate limit pages), then NUMA migration throughput will
  not be rate-limited to respond to the workload changing quickly.

- Otherwise, counting the number of pages to try to migrate to a DRAM
  node via autonuma, if the count exceeds the limit specified by the
  users, stop NUMA migration until the next second.

A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
the users to specify the limit.  If its value is 0, the default
value (high watermark) will be used.

TODO: Add ABI document for new sysctl knob.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mmzone.h       |  7 ++++
 include/linux/sched/sysctl.h |  6 ++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  8 +++++
 mm/vmstat.c                  |  3 ++
 5 files changed, 86 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dfb09106ad70..6e7a28becdc2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -249,6 +249,9 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+#ifdef CONFIG_NUMA_BALANCING
+	NUMA_TRY_MIGRATE,	/* pages to try to migrate via NUMA balancing */
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -786,6 +789,10 @@ typedef struct pglist_data {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long numa_ts;
+	unsigned long numa_try;
+#endif
 	/* Fields commonly accessed by the page reclaim scanner */
 
 	/*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 80dc5030c797..c4b27790b901 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -43,6 +43,12 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 
+#ifdef CONFIG_NUMA_BALANCING
+extern unsigned int sysctl_numa_balancing_rate_limit;
+#else
+#define sysctl_numa_balancing_rate_limit	0
+#endif
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..ef694816150b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,12 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Restrict the NUMA migration per second in MB for each target node
+ * if no enough free space in target node
+ */
+unsigned int sysctl_numa_balancing_rate_limit;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -1404,6 +1410,43 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
 	return 1000 * faults / total_faults;
 }
 
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long rate_limit;
+
+	rate_limit = sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      high_wmark_pages(zone) + rate_limit * 2,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
+					    unsigned long rate_limit, int nr)
+{
+	unsigned long try;
+	unsigned long now = jiffies, last_ts;
+
+	mod_node_page_state(pgdat, NUMA_TRY_MIGRATE, nr);
+	try = node_page_state(pgdat, NUMA_TRY_MIGRATE);
+	last_ts = pgdat->numa_ts;
+	if (now > last_ts + HZ &&
+	    cmpxchg(&pgdat->numa_ts, last_ts, now) == last_ts)
+		pgdat->numa_try = try;
+	if (try - pgdat->numa_try > rate_limit)
+		return false;
+	return true;
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 				int src_nid, int dst_cpu)
 {
@@ -1411,6 +1454,25 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
 
+	/*
+	 * If memory tiering mode is enabled, will try promote pages
+	 * in slow memory node to fast memory node.
+	 */
+	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+	    next_promotion_node(src_nid) != -1) {
+		struct pglist_data *pgdat;
+		unsigned long rate_limit;
+
+		pgdat = NODE_DATA(dst_nid);
+		if (pgdat_free_space_enough(pgdat))
+			return true;
+
+		rate_limit =
+			sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+		return numa_migration_check_rate_limit(pgdat, rate_limit,
+						       hpage_nr_pages(page));
+	}
+
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3756108bb658..2d19e821267a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -419,6 +419,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "numa_balancing_rate_limit_mbps",
+		.data		= &sysctl_numa_balancing_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= &sysctl_numa_balancing_mode,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d76714d2fd7c..9326512c612c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1203,6 +1203,9 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+#ifdef CONFIG_NUMA_BALANCING
+	"numa_try_migrate",
+#endif
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  9:09   ` Mel Gorman
  2020-02-18  8:26 ` [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

In a memory tiering system, if the memory size of the workloads is
smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
the workloads should be put in the faster memory nodes.  But this
makes it unnecessary to use slower memory (e.g. PMEM) at all.

So in common cases, the memory size of the workload should be larger
than that of the faster memory nodes.  And to optimize the
performance, the hot pages should be promoted to the faster memory
nodes while the cold pages should be demoted to the slower memory
nodes.  To achieve that, we have two choices,

a. Promote the hot pages from the slower memory node to the faster
   memory node.  This will create some memory pressure in the faster
   memory node, thus trigger the memory reclaiming, where the cold
   pages will be demoted to the slower memory node.

b. Demote the cold pages from faster memory node to the slower memory
   node.  This will create some free memory space in the faster memory
   node, and the hot pages in the slower memory node could be promoted
   to the faster memory node.

The choice "a" will create the memory pressure in the faster memory
node.  If the memory pressure of the workload is high too, the memory
pressure may become so high that the memory allocation latency of the
workload is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect.  If the memory
pressure of the workload is high, it will consume the free memory and
the hot pages promotion will stop earlier if its allocation watermark
is higher than that of the normal memory allocation.

In this patch, choice "b" is implemented.  If memory tiering NUMA
balancing mode is enabled, the node isn't the slowest node, and the
free memory size of the node is below the high watermark, the kswapd
of the node will be waken up to free some memory until the free memory
size is above the high watermark + autonuma promotion rate limit.  If
the free memory size is below the high watermark, autonuma promotion
will stop working.  This avoids to create too much memory pressure to
the system.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/migrate.c | 26 +++++++++++++++++---------
 mm/vmscan.c  |  7 +++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0b046759f99a..bbf16764d105 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -48,6 +48,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/sched/sysctl.h>
 
 #include <asm/tlbflush.h>
 
@@ -1946,8 +1947,7 @@ COMPAT_SYSCALL_DEFINE6(move_pages, pid_t, pid, compat_ulong_t, nr_pages,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which crude
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
-				   unsigned long nr_migrate_pages)
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order)
 {
 	int z;
 
@@ -1958,12 +1958,9 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 			continue;
 
 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
-		if (!zone_watermark_ok(zone, 0,
-				       high_wmark_pages(zone) +
-				       nr_migrate_pages,
-				       ZONE_MOVABLE, 0))
-			continue;
-		return true;
+		if (zone_watermark_ok(zone, order, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			return true;
 	}
 	return false;
 }
@@ -1990,8 +1987,19 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
 	/* Avoid migrating to a node that is nearly full */
-	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
+	if (!migrate_balanced_pgdat(pgdat, compound_order(page))) {
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) {
+			int z;
+
+			for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+				if (populated_zone(pgdat->node_zones + z))
+					break;
+			}
+			wakeup_kswapd(pgdat->node_zones + z,
+				      0, compound_order(page), ZONE_MOVABLE);
+		}
 		return 0;
+	}
 
 	if (isolate_lru_page(page))
 		return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe90236045d5..b265868d62ef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 
 #include <linux/swapops.h>
 #include <linux/balloon_compaction.h>
+#include <linux/sched/sysctl.h>
 
 #include "internal.h"
 
@@ -3462,8 +3463,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	unsigned long mark = -1;
+	unsigned long promote_ratelimit;
 	struct zone *zone;
 
+	promote_ratelimit = sysctl_numa_balancing_rate_limit <<
+		(20 - PAGE_SHIFT);
 	/*
 	 * Check watermarks bottom-up as lower zones are more likely to
 	 * meet watermarks.
@@ -3475,6 +3479,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;
 
 		mark = high_wmark_pages(zone);
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+		    next_migration_node(pgdat->node_id) != -1)
+			mark += promote_ratelimit;
 		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 			return true;
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (2 preceding siblings ...)
  2020-02-18  8:26 ` [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

In memory tiering NUMA balancing mode, the hot pages of the workload
in the fastest memory node couldn't be promoted to anywhere, so it's
unnecessary to identify the hot pages in the fastest memory node via
changing their PTE mapping to have PROT_NONE.  So that the page faults
could be avoided too.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
4.6% on a 2 socket Intel server with Optance DC Persistent Memory.
The autonuma hint faults for DRAM node is reduced to almost 0 in the
test.

Known problem: the statistics of autonuma such as per-node memory
accesses, and local/remote ratio, etc. will be influenced.  Especially
the NUMA scanning period automatic adjustment will not work
reasonably.  So we cannot rely on that.  Fortunately, there's no CPU
in the PMEM NUMA nodes, so we will not move tasks there because of
the statistics issue.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/huge_memory.c | 30 +++++++++++++++++++++---------
 mm/mprotect.c    | 14 +++++++++++++-
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a88093213674..d45de9b1ead9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 #include <linux/page_owner.h>
+#include <linux/sched/sysctl.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1967,17 +1968,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 #endif
 
-	/*
-	 * Avoid trapping faults against the zero page. The read-only
-	 * data is likely to be read-cached on the local CPU and
-	 * local/remote hits to the zero page are not interesting.
-	 */
-	if (prot_numa && is_huge_zero_pmd(*pmd))
-		goto unlock;
+	if (prot_numa) {
+		struct page *page;
+		/*
+		 * Avoid trapping faults against the zero page. The read-only
+		 * data is likely to be read-cached on the local CPU and
+		 * local/remote hits to the zero page are not interesting.
+		 */
+		if (is_huge_zero_pmd(*pmd))
+			goto unlock;
 
-	if (prot_numa && pmd_protnone(*pmd))
-		goto unlock;
+		if (pmd_protnone(*pmd))
+			goto unlock;
 
+		page = pmd_page(*pmd);
+		/*
+		 * Skip if normal numa balancing is disabled and no
+		 * faster memory node to promote to
+		 */
+		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+		    next_promotion_node(page_to_nid(page)) == -1)
+			goto unlock;
+	}
 	/*
 	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7a8e84f86831..7322c98284ac 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,6 +28,7 @@
 #include <linux/ksm.h>
 #include <linux/uaccess.h>
 #include <linux/mm_inline.h>
+#include <linux/sched/sysctl.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
@@ -79,6 +80,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (prot_numa) {
 				struct page *page;
+				int nid;
 
 				/* Avoid TLB flush if possible */
 				if (pte_protnone(oldpte))
@@ -105,7 +107,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * Don't mess with PTEs if page is already on the node
 				 * a single-threaded process is running on.
 				 */
-				if (target_node == page_to_nid(page))
+				nid = page_to_nid(page);
+				if (target_node == nid)
+					continue;
+
+				/*
+				 * Skip scanning if normal numa
+				 * balancing is disabled and no faster
+				 * memory node to promote to
+				 */
+				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+				    next_promotion_node(nid) == -1)
 					continue;
 			}
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (3 preceding siblings ...)
  2020-02-18  8:26 ` [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

The original assumption of auto NUMA balancing is that the memory
privately or mainly accessed by the CPUs of a NUMA node (called
private memory) should fit the memory size of the NUMA node.  So if a
page is identified to be private memory, it will be migrated to the
target node immediately.  Eventually all private memory will be
migrated.

But this assumption isn't true in memory tiering system.  In a typical
memory tiering system, there are CPUs, fast memory, and slow memory in
each physical NUMA node.  The CPUs and the fast memory will be put in
one logical node (called fast memory node), while the slow memory will
be put in another (faked) logical node (called slow memory node).  To
take full advantage of the system resources, it's common that the size
of the private memory of the workload is larger than the memory size
of the fast memory node.

To resolve the issue, we will try to migrate only the hot pages in the
private memory to the fast memory node.  A private page that was
accessed at least twice in the current and the last scanning periods
will be identified as the hot page and migrated.  Otherwise, the page
isn't considered hot enough to be migrated.

To record whether a page is accessed in the last scanning period, the
Accessed bit of the PTE/PMD is used.  When the page tables are scanned
for autonuma, if the pte_protnone(pte) is true, the page isn't
accessed in the last scan period, and the Accessed bit will be
cleared, otherwise the Accessed bit will be kept.  When NUMA page
fault occurs, if the Accessed bit is set, the page has been accessed
at least twice in the current and the last scanning period and will be
migrated.

The Accessed bit of PTE/PMD is used by page reclaiming too.  So the
conflict is possible.  Considering the following situation,

a) the page is moved from active list to inactive list with Accessed
   bit cleared

b) the page is accessed, so Accessed bit is set

c) the page table is scanned by autonuma, PTE is set to
   PROTNONE+Accessed

c) the page isn't accessed

d) the page table is scanned by autonuma again, Accessed is cleared

e) the inactive list is scanned for reclaiming, the page is reclaimed
   wrongly because Accessed bit is cleared by autonuma

Although the page is reclaimed wrongly, it hasn't been accessed for
one numa balancing scanning period at least.  So the page isn't so hot
too.  That is, this shouldn't be a severe issue.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
3.1% on a 2 socket Intel server with Optance DC Persistent Memory.  In
the test, the number of the promoted pages for autonuma reduces 7.2%
because the pages fail to pass the twice access checking.

Problems:

 - how to adjust scanning period upon hot page identification
   requirement.  E.g. if the count of page promotion is much larger
   than free memory, we need to scan faster to identify really hot
   pages.  But his will trigger too many page faults too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/huge_memory.c | 17 ++++++++++++++++-
 mm/memory.c      | 28 +++++++++++++++-------------
 mm/mprotect.c    | 15 ++++++++++++++-
 3 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d45de9b1ead9..8808e50ad921 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1558,6 +1558,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
 		goto out_unlock;
 
+	/* Only migrate if accessed twice */
+	if (!pmd_young(*vmf->pmd))
+		goto out_unlock;
+
 	/*
 	 * If there are potential migrations, wait for completion and retry
 	 * without disrupting NUMA hinting information. Do not relock and
@@ -1978,8 +1982,19 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (is_huge_zero_pmd(*pmd))
 			goto unlock;
 
-		if (pmd_protnone(*pmd))
+		if (pmd_protnone(*pmd)) {
+			if (!(sysctl_numa_balancing_mode &
+			      NUMA_BALANCING_MEMORY_TIERING))
+				goto unlock;
+
+			/*
+			 * PMD young bit is used to record whether the
+			 * page is accessed in last scan period
+			 */
+			if (pmd_young(*pmd))
+				set_pmd_at(mm, addr, pmd, pmd_mkold(*pmd));
 			goto unlock;
+		}
 
 		page = pmd_page(*pmd);
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9a4f52..afb4c55cb278 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3811,10 +3811,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 */
 	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		goto out;
-	}
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+		goto unmap_out;
 
 	/*
 	 * Make it present again, Depending on how arch implementes non
@@ -3828,17 +3826,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
+	/* Only migrate if accessed twice */
+	if (!pte_young(old_pte))
+		goto unmap_out;
+
 	page = vm_normal_page(vma, vmf->address, pte);
-	if (!page) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (!page)
+		goto unmap_out;
 
 	/* TODO: handle PTE-mapped THP */
-	if (PageCompound(page)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (PageCompound(page))
+		goto unmap_out;
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -3876,10 +3874,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	} else
 		flags |= TNF_MIGRATE_FAIL;
 
-out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
+
+unmap_out:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
+	return 0;
 }
 
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7322c98284ac..1948105d23d5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,8 +83,21 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				int nid;
 
 				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
+				if (pte_protnone(oldpte)) {
+					if (!(sysctl_numa_balancing_mode &
+					      NUMA_BALANCING_MEMORY_TIERING))
+						continue;
+
+					/*
+					 * PTE young bit is used to record
+					 * whether the page is accessed in
+					 * last scan period
+					 */
+					if (pte_young(oldpte))
+						set_pte_at(vma->vm_mm, addr, pte,
+							   pte_mkold(oldpte));
 					continue;
+				}
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (4 preceding siblings ...)
  2020-02-18  8:26 ` [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Zhou, Jianshi,
	Fengguang Wu, Andrew Morton, Michal Hocko, Rik van Riel,
	Mel Gorman, Dave Hansen, Dan Williams

From: Huang Ying <ying.huang@intel.com>

In memory tiering system, to maximize the overall system performance,
the hot pages should be put in the fast memory node while the cold
pages should be put in the slow memory node.  In original memory
tiering autonuma implementation, we will try to promote almost all
recently accessed pages, and use the LRU algorithm in page reclaiming
to keep the hot pages in the fast memory node and demote the cold
pages to the slow memory node.  The problem of this solution is that
the cold pages with a low access frequency may be promoted then
demoted too.  So that the memory bandwidth is wasted.  And because
migration is rate-limited, the hot pages need to compete with the cold
pages for the limited migration bandwidth.

If we could select the hotter pages to promote to the fast memory node
in the first place, then the wasted migration bandwidth would be
reduced and the hot pages would be promoted more quickly.

The patch "autonuma, memory tiering: Only promote page if accessed
twice" in the series will prevent the really cold pages that are not
accessed in the last scan period from being promoted.  But the scan
period could be as long as tens seconds, so it doesn't work well
enough on selecting the hotter pages.

To identify the hotter pages, in this patch we implemented a method
suggested by Jianshi and Fengguang.  Which is based on autonuma page
table scanning and hint page fault as follows,

- When a range of the page table is scanned in autonuma, the timestamp
  and the address range is recorded in a ring buffer in struct
  mm_struct.  So we have information of recent N scans.

- When the autonuma hint page fault occurs, the fault address is
  searched in the ring buffer to get its scanning timestamp.  The hint
  page fault latency is defined as

    hint page fault timestamp - scan timestamp

  If the access frequency of the hotter pages is higher, the
  probability for their hint page fault latency to be shorter is
  higher too.  So the hint page fault latency is a good estimation of
  the page heat.

The size of ring buffer should record NUMA scanning history reasonably
long.  From task_scan_min(), the minimal interval between
task_numa_work() running is about 100 ms by default.  So we can keep
1600 ms history by default if set the size to 16.  If user choose to
use smaller sysctl_numa_balancing_scan_size, then we can only keep
shorter history.  In general, we want to keep no less than 1000 ms
history.  So 16 seems a reasonable choice.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
9.2% with 50.3% fewer NUMA page migrations on a 2 socket Intel server
with Optance DC Persistent Memory.  That is, the cost of autonuma page
migration reduces considerably.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency may not be shorter than the hot threshold.  So the pages may
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node (> high
  watermark + 2 * promotion rate limit), the hot threshold will not be
  used, all pages will be promoted upon the hint page fault for fast
  response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: "Zhou, Jianshi" <jianshi.zhou@intel.com>
Suggested-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mempolicy.h            |  5 +-
 include/linux/mm_types.h             |  9 ++++
 include/linux/sched/numa_balancing.h |  8 ++-
 include/linux/sched/sysctl.h         |  1 +
 kernel/sched/fair.c                  | 78 +++++++++++++++++++++++++---
 kernel/sysctl.c                      |  7 +++
 mm/huge_memory.c                     |  6 +--
 mm/memory.c                          |  7 ++-
 mm/mempolicy.c                       |  7 ++-
 9 files changed, 108 insertions(+), 20 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..674aaa7614ed 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,7 +202,8 @@ static inline bool vma_migratable(struct vm_area_struct *vma)
 	return true;
 }
 
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long,
+			  int flags);
 extern void mpol_put_task_policy(struct task_struct *);
 
 #else
@@ -300,7 +301,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
 #endif
 
 static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address)
+				 unsigned long address, int flags)
 {
 	return -1; /* no node preference */
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 270aa8fd2800..2fed3d92bbc1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -508,6 +508,15 @@ struct mm_struct {
 
 		/* numa_scan_seq prevents two threads setting pte_numa */
 		int numa_scan_seq;
+
+		/*
+		 * Keep 1600ms history of NUMA scanning, when default
+		 * 100ms minimal scanning interval is used.
+		 */
+#define NUMA_SCAN_NR_HIST	16
+		int numa_scan_idx;
+		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
+		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];
 #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 3988762efe15..4899ec000245 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -14,6 +14,7 @@
 #define TNF_SHARED	0x04
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
+#define TNF_YOUNG	0x20
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
@@ -21,7 +22,8 @@ extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p, bool final);
 extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
-					int src_nid, int dst_cpu);
+				       int src_nid, int dst_cpu,
+				       unsigned long addr, int flags);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -38,7 +40,9 @@ static inline void task_numa_free(struct task_struct *p, bool final)
 {
 }
 static inline bool should_numa_migrate_memory(struct task_struct *p,
-				struct page *page, int src_nid, int dst_cpu)
+					      struct page *page, int src_nid,
+					      int dst_cpu, unsigned long addr,
+					      int flags)
 {
 	return true;
 }
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c4b27790b901..c207709ff498 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -42,6 +42,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_hot_threshold;
 
 #ifdef CONFIG_NUMA_BALANCING
 extern unsigned int sysctl_numa_balancing_rate_limit;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ef694816150b..773f3220efc4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1070,6 +1070,9 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_rate_limit;
 
+/* The page with hint page fault latency < threshold in ms is considered hot */
+unsigned int sysctl_numa_balancing_hot_threshold = 1000;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -1430,6 +1433,43 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
 	return false;
 }
 
+static long numa_hint_fault_latency(struct task_struct *p, unsigned long addr)
+{
+	struct mm_struct *mm = p->mm;
+	unsigned long now = jiffies;
+	unsigned long start, end;
+	int i, j;
+	long latency = 0;
+
+	/*
+	 * Paired with smp_store_release() in task_numa_work() to check
+	 * scan range buffer after get current index
+	 */
+	i = smp_load_acquire(&mm->numa_scan_idx);
+	i = (i - 1) % NUMA_SCAN_NR_HIST;
+
+	end = READ_ONCE(mm->numa_scan_offset);
+	start = READ_ONCE(mm->numa_scan_starts[i]);
+	if (start == end)
+		end = start + MAX_SCAN_WINDOW * (1UL << 22);
+	for (j = 0; j < NUMA_SCAN_NR_HIST; j++) {
+		latency = now - READ_ONCE(mm->numa_scan_jiffies[i]);
+		start = READ_ONCE(mm->numa_scan_starts[i]);
+		/* Scan pass the end of address space */
+		if (end < start)
+			end = TASK_SIZE;
+		if (addr >= start && addr < end)
+			return latency;
+		end = start;
+		i = (i - 1) % NUMA_SCAN_NR_HIST;
+	}
+	/*
+	 * The tracking window isn't large enough, approximate to the
+	 * max latency in the tracking window.
+	 */
+	return latency;
+}
+
 static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 					    unsigned long rate_limit, int nr)
 {
@@ -1448,7 +1488,8 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
-				int src_nid, int dst_cpu)
+				int src_nid, int dst_cpu, unsigned long addr,
+				int flags)
 {
 	struct numa_group *ng = deref_curr_numa_group(p);
 	int dst_nid = cpu_to_node(dst_cpu);
@@ -1461,12 +1502,21 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 	    next_promotion_node(src_nid) != -1) {
 		struct pglist_data *pgdat;
-		unsigned long rate_limit;
+		unsigned long rate_limit, latency, th;
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat))
 			return true;
 
+		/* The page hasn't been accessed in the last scan period */
+		if (!(flags & TNF_YOUNG))
+			return false;
+
+		th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold);
+		latency = numa_hint_fault_latency(p, addr);
+		if (latency > th)
+			return false;
+
 		rate_limit =
 			sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
 		return numa_migration_check_rate_limit(pgdat, rate_limit,
@@ -2540,7 +2590,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	 * expensive, to avoid any form of compiler optimizations:
 	 */
 	WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
-	p->mm->numa_scan_offset = 0;
+	WRITE_ONCE(p->mm->numa_scan_offset, 0);
 }
 
 /*
@@ -2557,6 +2607,7 @@ static void task_numa_work(struct callback_head *work)
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
+	int idx;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -2615,6 +2666,15 @@ static void task_numa_work(struct callback_head *work)
 		start = 0;
 		vma = mm->mmap;
 	}
+	idx = mm->numa_scan_idx;
+	WRITE_ONCE(mm->numa_scan_starts[idx], start);
+	WRITE_ONCE(mm->numa_scan_jiffies[idx], jiffies);
+	/*
+	 * Paired with smp_load_acquire() in numa_hint_fault_latency()
+	 * to update scan range buffer index after update the buffer
+	 * contents.
+	 */
+	smp_store_release(&mm->numa_scan_idx, (idx + 1) % NUMA_SCAN_NR_HIST);
 	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
@@ -2642,6 +2702,7 @@ static void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
+			WRITE_ONCE(mm->numa_scan_offset, end);
 			nr_pte_updates = change_prot_numa(vma, start, end);
 
 			/*
@@ -2671,9 +2732,7 @@ static void task_numa_work(struct callback_head *work)
 	 * would find the !migratable VMA on the next scan but not reset the
 	 * scanner to the start so check it now.
 	 */
-	if (vma)
-		mm->numa_scan_offset = start;
-	else
+	if (!vma)
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
 
@@ -2691,7 +2750,7 @@ static void task_numa_work(struct callback_head *work)
 
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
-	int mm_users = 0;
+	int i, mm_users = 0;
 	struct mm_struct *mm = p->mm;
 
 	if (mm) {
@@ -2699,6 +2758,11 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 		if (mm_users == 1) {
 			mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 			mm->numa_scan_seq = 0;
+			mm->numa_scan_idx = 0;
+			for (i = 0; i < NUMA_SCAN_NR_HIST; i++) {
+				mm->numa_scan_jiffies[i] = 0;
+				mm->numa_scan_starts[i] = 0;
+			}
 		}
 	}
 	p->node_stamp			= 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2d19e821267a..da1fc0303cca 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -427,6 +427,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "numa_balancing_hot_threshold_ms",
+		.data		= &sysctl_numa_balancing_hot_threshold,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= &sysctl_numa_balancing_mode,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8808e50ad921..08d25763e65f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1559,8 +1559,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 		goto out_unlock;
 
 	/* Only migrate if accessed twice */
-	if (!pmd_young(*vmf->pmd))
-		goto out_unlock;
+	if (pmd_young(*vmf->pmd))
+		flags |= TNF_YOUNG;
 
 	/*
 	 * If there are potential migrations, wait for completion and retry
@@ -1595,7 +1595,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 * page_table_lock if at all possible
 	 */
 	page_locked = trylock_page(page);
-	target_nid = mpol_misplaced(page, vma, haddr);
+	target_nid = mpol_misplaced(page, vma, haddr, flags);
 	if (target_nid == NUMA_NO_NODE) {
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
diff --git a/mm/memory.c b/mm/memory.c
index afb4c55cb278..207caa9e61da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3789,7 +3789,7 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		*flags |= TNF_FAULT_LOCAL;
 	}
 
-	return mpol_misplaced(page, vma, addr);
+	return mpol_misplaced(page, vma, addr, *flags);
 }
 
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
@@ -3826,9 +3826,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
-	/* Only migrate if accessed twice */
-	if (!pte_young(old_pte))
-		goto unmap_out;
+	if (pte_young(old_pte))
+		flags |= TNF_YOUNG;
 
 	page = vm_normal_page(vma, vmf->address, pte);
 	if (!page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 22b4d1a0ea53..4f9301195de5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2394,6 +2394,7 @@ static void sp_free(struct sp_node *n)
  * @page: page to be checked
  * @vma: vm area where page mapped
  * @addr: virtual address where page mapped
+ * @flags: numa balancing flags
  *
  * Lookup current policy node id for vma,addr and "compare to" page's
  * node id.
@@ -2405,7 +2406,8 @@ static void sp_free(struct sp_node *n)
  * Policy determination "mimics" alloc_page_vma().
  * Called from fault path where we know the vma and faulting address.
  */
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+		   unsigned long addr, int flags)
 {
 	struct mempolicy *pol;
 	struct zoneref *z;
@@ -2459,7 +2461,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	if (pol->flags & MPOL_F_MORON) {
 		polnid = thisnid;
 
-		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
+		if (!should_numa_migrate_memory(current, page, curnid,
+						thiscpu, addr, flags))
 			goto out;
 	}
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (5 preceding siblings ...)
  2020-02-18  8:26 ` [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  2020-02-18  8:26 ` [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

The write performance of PMEM is much worse than its read
performance.  So even if a write-mostly pages is colder than a
read-mostly pages, it is usually better to put the write-mostly pages
in DRAM and read-mostly pages in PMEM.

To give write-mostly pages more opportunity to be promoted to DRAM, in
this patch, the hot threshold for write hint page fault is
doubled (easier to be promoted).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/sched/numa_balancing.h | 1 +
 kernel/sched/fair.c                  | 2 ++
 mm/memory.c                          | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 4899ec000245..518b3d6143ba 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -15,6 +15,7 @@
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 #define TNF_YOUNG	0x20
+#define TNF_WRITE	0x40
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 773f3220efc4..e5f7f4139c82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1513,6 +1513,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 			return false;
 
 		th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold);
+		if (flags & TNF_WRITE)
+			th *= 2;
 		latency = numa_hint_fault_latency(p, addr);
 		if (latency > th)
 			return false;
diff --git a/mm/memory.c b/mm/memory.c
index 207caa9e61da..595d3cd62f61 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3829,6 +3829,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (pte_young(old_pte))
 		flags |= TNF_YOUNG;
 
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		flags |= TNF_WRITE;
+
 	page = vm_normal_page(vma, vmf->address, pte);
 	if (!page)
 		goto unmap_out;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically
  2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (6 preceding siblings ...)
  2020-02-18  8:26 ` [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
@ 2020-02-18  8:26 ` Huang, Ying
  7 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-18  8:26 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Feng Tang, Huang Ying, Andrew Morton,
	Michal Hocko, Rik van Riel, Mel Gorman, Dave Hansen,
	Dan Williams

From: Huang Ying <ying.huang@intel.com>

It isn't easy for the administrator to determine the hot threshold.
So in this patch, a method to adjust the hot threshold automatically
is implemented.  The basic idea is to control the number of the
candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we
will try to promote the page, that is, the page is the candidate
promotion page.

If the number of the candidate promotion pages in the statistics
interval is much higher than the promotion rate limit, the hot
threshold will be lowered to reduce the number of the candidate
promotion pages.  Otherwise, the hot threshold will be raised to
increase the number of the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and
the hot/cold distribution need to be stable.  Because the page tables
are scanned linearly in autonuma, but the hot/cold distribution isn't
uniform along the address.  The statistics interval should be larger
than the autonuma scan period.  So in the patch, the max scan period
is used as statistics interval and it works well in our tests.

The sysctl knob kernel.numa_balancing_hot_threshold_ms becomes the
initial value and max value of the hot threshold.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
5.5% with 24.6% fewer NUMA page migrations on a 2 socket Intel server
with Optance DC Persistent Memory.  Because it improves the accuracy
of the hot page selection.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mmzone.h |  3 +++
 kernel/sched/fair.c    | 40 ++++++++++++++++++++++++++++++++++++----
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e7a28becdc2..4ed82eb5c8b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -792,6 +792,9 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned long numa_ts;
 	unsigned long numa_try;
+	unsigned long numa_threshold_ts;
+	unsigned long numa_threshold_try;
+	unsigned long numa_threshold;
 #endif
 	/* Fields commonly accessed by the page reclaim scanner */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5f7f4139c82..90098c35d336 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1487,6 +1487,35 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 	return true;
 }
 
+#define NUMA_MIGRATION_ADJUST_STEPS	16
+
+static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
+					    unsigned long rate_limit,
+					    unsigned long ref_th)
+{
+	unsigned long now = jiffies, last_th_ts, th_period;
+	unsigned long unit_th, th;
+	unsigned long try, ref_try, tdiff;
+
+	th_period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+	last_th_ts = pgdat->numa_threshold_ts;
+	if (now > last_th_ts + th_period &&
+	    cmpxchg(&pgdat->numa_threshold_ts, last_th_ts, now) == last_th_ts) {
+		ref_try = rate_limit *
+			sysctl_numa_balancing_scan_period_max / 1000;
+		try = node_page_state(pgdat, NUMA_TRY_MIGRATE);
+		tdiff = try - pgdat->numa_threshold_try;
+		unit_th = ref_th / NUMA_MIGRATION_ADJUST_STEPS;
+		th = pgdat->numa_threshold ? : ref_th;
+		if (tdiff > ref_try * 11 / 10)
+			th = max(th - unit_th, unit_th);
+		else if (tdiff < ref_try * 9 / 10)
+			th = min(th + unit_th, ref_th);
+		pgdat->numa_threshold_try = try;
+		pgdat->numa_threshold = th;
+	}
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 				int src_nid, int dst_cpu, unsigned long addr,
 				int flags)
@@ -1502,7 +1531,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 	    next_promotion_node(src_nid) != -1) {
 		struct pglist_data *pgdat;
-		unsigned long rate_limit, latency, th;
+		unsigned long rate_limit, latency, th, def_th;
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat))
@@ -1512,15 +1541,18 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 		if (!(flags & TNF_YOUNG))
 			return false;
 
-		th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold);
+		def_th = msecs_to_jiffies(sysctl_numa_balancing_hot_threshold);
+		rate_limit =
+			sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+		numa_migration_adjust_threshold(pgdat, rate_limit, def_th);
+
+		th = pgdat->numa_threshold ? : def_th;
 		if (flags & TNF_WRITE)
 			th *= 2;
 		latency = numa_hint_fault_latency(p, addr);
 		if (latency > th)
 			return false;
 
-		rate_limit =
-			sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
 		return numa_migration_check_rate_limit(pgdat, rate_limit,
 						       hpage_nr_pages(page));
 	}
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput
  2020-02-18  8:26 ` [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
@ 2020-02-18  8:57   ` Mel Gorman
  2020-02-19  6:01     ` Huang, Ying
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2020-02-18  8:57 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel, Feng Tang,
	Andrew Morton, Michal Hocko, Rik van Riel, Dave Hansen,
	Dan Williams

On Tue, Feb 18, 2020 at 04:26:28PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In autonuma memory tiering mode, the hot PMEM (persistent memory)
> pages could be migrated to DRAM via autonuma.  But this incurs some
> overhead too.  So that sometimes the workload performance may be hurt.
> To avoid too much disturbing to the workload, the migration throughput
> should be rate-limited.
> 
> At the other hand, in some situation, for example, some workloads
> exits, many DRAM pages become free, so that some pages of the other
> workloads can be migrated to DRAM.  To respond to the workloads
> changing quickly, it's better to migrate pages faster.
> 
> To address the above 2 requirements, a rate limit algorithm as follows
> is used,
> 
> - If there is enough free memory in DRAM node (that is, > high
>   watermark + 2 * rate limit pages), then NUMA migration throughput will
>   not be rate-limited to respond to the workload changing quickly.
> 
> - Otherwise, counting the number of pages to try to migrate to a DRAM
>   node via autonuma, if the count exceeds the limit specified by the
>   users, stop NUMA migration until the next second.
> 
> A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
> the users to specify the limit.  If its value is 0, the default
> value (high watermark) will be used.
> 
> TODO: Add ABI document for new sysctl knob.
> 

I very strongly suggest that this only be done as a last resort and with
supporting data as to why it is necessary. NUMA balancing did have rate
limiting at one point and it was removed when balancing was smart enough
to mostly do the right thing without rate limiting. I posted a series
that reconciled NUMA balancing with the CPU load balancer recently which
further reduced spurious and unnecessary migrations. I would not like
to see rate limiting reintroduced unless there is no other way of fixing
saturation of memory bandwidth due to NUMA balancing. Even if it's
needed as a stopgap while the feature is finalised, it should be
introduced late in the series explaining why it's temporarily necessary.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM
  2020-02-18  8:26 ` [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
@ 2020-02-18  9:09   ` Mel Gorman
  2020-02-19  6:05     ` Huang, Ying
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2020-02-18  9:09 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel, Feng Tang,
	Andrew Morton, Michal Hocko, Rik van Riel, Dave Hansen,
	Dan Williams

On Tue, Feb 18, 2020 at 04:26:29PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In a memory tiering system, if the memory size of the workloads is
> smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
> the workloads should be put in the faster memory nodes.  But this
> makes it unnecessary to use slower memory (e.g. PMEM) at all.
> 
> So in common cases, the memory size of the workload should be larger
> than that of the faster memory nodes.  And to optimize the
> performance, the hot pages should be promoted to the faster memory
> nodes while the cold pages should be demoted to the slower memory
> nodes.  To achieve that, we have two choices,
> 
> a. Promote the hot pages from the slower memory node to the faster
>    memory node.  This will create some memory pressure in the faster
>    memory node, thus trigger the memory reclaiming, where the cold
>    pages will be demoted to the slower memory node.
> 
> b. Demote the cold pages from faster memory node to the slower memory
>    node.  This will create some free memory space in the faster memory
>    node, and the hot pages in the slower memory node could be promoted
>    to the faster memory node.
> 
> The choice "a" will create the memory pressure in the faster memory
> node.  If the memory pressure of the workload is high too, the memory
> pressure may become so high that the memory allocation latency of the
> workload is influenced, e.g. the direct reclaiming may be triggered.
> 
> The choice "b" works much better at this aspect.  If the memory
> pressure of the workload is high, it will consume the free memory and
> the hot pages promotion will stop earlier if its allocation watermark
> is higher than that of the normal memory allocation.
> 
> In this patch, choice "b" is implemented.  If memory tiering NUMA
> balancing mode is enabled, the node isn't the slowest node, and the
> free memory size of the node is below the high watermark, the kswapd
> of the node will be waken up to free some memory until the free memory
> size is above the high watermark + autonuma promotion rate limit.  If
> the free memory size is below the high watermark, autonuma promotion
> will stop working.  This avoids to create too much memory pressure to
> the system.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Unfortunately I stopped reading at this point. It depends on another series
entirely and they really need to be presented together instead of relying
on searching mail archives to find other patches to try assemble the full
picture :(. Ideally each stage would have supporting data showing roughly
how it behaves at each major stage. I know this will be a pain but the
original NUMA balancing had the same problem and ultimately started with
one large series that got the basics right followed by other series that
improved it in stages. That process is *still* ongoing today.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput
  2020-02-18  8:57   ` Mel Gorman
@ 2020-02-19  6:01     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-19  6:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel, Feng Tang,
	Andrew Morton, Michal Hocko, Rik van Riel, Dave Hansen,
	Dan Williams

Hi, Mel,

Thanks a lot for your review!

Mel Gorman <mgorman@suse.de> writes:

> On Tue, Feb 18, 2020 at 04:26:28PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In autonuma memory tiering mode, the hot PMEM (persistent memory)
>> pages could be migrated to DRAM via autonuma.  But this incurs some
>> overhead too.  So that sometimes the workload performance may be hurt.
>> To avoid too much disturbing to the workload, the migration throughput
>> should be rate-limited.
>> 
>> At the other hand, in some situation, for example, some workloads
>> exits, many DRAM pages become free, so that some pages of the other
>> workloads can be migrated to DRAM.  To respond to the workloads
>> changing quickly, it's better to migrate pages faster.
>> 
>> To address the above 2 requirements, a rate limit algorithm as follows
>> is used,
>> 
>> - If there is enough free memory in DRAM node (that is, > high
>>   watermark + 2 * rate limit pages), then NUMA migration throughput will
>>   not be rate-limited to respond to the workload changing quickly.
>> 
>> - Otherwise, counting the number of pages to try to migrate to a DRAM
>>   node via autonuma, if the count exceeds the limit specified by the
>>   users, stop NUMA migration until the next second.
>> 
>> A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
>> the users to specify the limit.  If its value is 0, the default
>> value (high watermark) will be used.
>> 
>> TODO: Add ABI document for new sysctl knob.
>> 
>
> I very strongly suggest that this only be done as a last resort and with
> supporting data as to why it is necessary. NUMA balancing did have rate
> limiting at one point and it was removed when balancing was smart enough
> to mostly do the right thing without rate limiting. I posted a series
> that reconciled NUMA balancing with the CPU load balancer recently which
> further reduced spurious and unnecessary migrations. I would not like
> to see rate limiting reintroduced unless there is no other way of fixing
> saturation of memory bandwidth due to NUMA balancing. Even if it's
> needed as a stopgap while the feature is finalised, it should be
> introduced late in the series explaining why it's temporarily necessary.

This adds rate limit to NUMA migration between the different
types of memory nodes only (e.g. from PMEM to DRAM), but not between the
same types of memory nodes (e.g. from DRAM to DRAM).  Sorry for
confusing patch subject.  I will change it in the next version.

And, rate limit is an inherent part of the algorithm used in the
patchset.  Because we just use LRU algorithm to find the cold pages on
the fast memory node (e.g. DRAM), and use NUMA hint page fault latency
to find the hot pages on the slow memory node (e.g. PMEM).  But we don't
compare the temperature between the cold DRAM pages and the hot PMEM
pages.  Instead, we just try to exchange some cold DRAM pages and some
hot PMEM pages, even if the cold DRAM pages is hotter than the hot PMEM
pages.  The rate limit is used to control how many pages to exchange
between DRAM and PMEM per second.  This isn't perfect, but it works well
in our testing.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM
  2020-02-18  9:09   ` Mel Gorman
@ 2020-02-19  6:05     ` Huang, Ying
  0 siblings, 0 replies; 13+ messages in thread
From: Huang, Ying @ 2020-02-19  6:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel, Feng Tang,
	Andrew Morton, Michal Hocko, Rik van Riel, Dave Hansen,
	Dan Williams

Mel Gorman <mgorman@suse.de> writes:

> On Tue, Feb 18, 2020 at 04:26:29PM +0800, Huang, Ying wrote:
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> In a memory tiering system, if the memory size of the workloads is
>> smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
>> the workloads should be put in the faster memory nodes.  But this
>> makes it unnecessary to use slower memory (e.g. PMEM) at all.
>> 
>> So in common cases, the memory size of the workload should be larger
>> than that of the faster memory nodes.  And to optimize the
>> performance, the hot pages should be promoted to the faster memory
>> nodes while the cold pages should be demoted to the slower memory
>> nodes.  To achieve that, we have two choices,
>> 
>> a. Promote the hot pages from the slower memory node to the faster
>>    memory node.  This will create some memory pressure in the faster
>>    memory node, thus trigger the memory reclaiming, where the cold
>>    pages will be demoted to the slower memory node.
>> 
>> b. Demote the cold pages from faster memory node to the slower memory
>>    node.  This will create some free memory space in the faster memory
>>    node, and the hot pages in the slower memory node could be promoted
>>    to the faster memory node.
>> 
>> The choice "a" will create the memory pressure in the faster memory
>> node.  If the memory pressure of the workload is high too, the memory
>> pressure may become so high that the memory allocation latency of the
>> workload is influenced, e.g. the direct reclaiming may be triggered.
>> 
>> The choice "b" works much better at this aspect.  If the memory
>> pressure of the workload is high, it will consume the free memory and
>> the hot pages promotion will stop earlier if its allocation watermark
>> is higher than that of the normal memory allocation.
>> 
>> In this patch, choice "b" is implemented.  If memory tiering NUMA
>> balancing mode is enabled, the node isn't the slowest node, and the
>> free memory size of the node is below the high watermark, the kswapd
>> of the node will be waken up to free some memory until the free memory
>> size is above the high watermark + autonuma promotion rate limit.  If
>> the free memory size is below the high watermark, autonuma promotion
>> will stop working.  This avoids to create too much memory pressure to
>> the system.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>
> Unfortunately I stopped reading at this point. It depends on another series
> entirely and they really need to be presented together instead of relying
> on searching mail archives to find other patches to try assemble the full
> picture :(. Ideally each stage would have supporting data showing roughly
> how it behaves at each major stage. I know this will be a pain but the
> original NUMA balancing had the same problem and ultimately started with
> one large series that got the basics right followed by other series that
> improved it in stages. That process is *still* ongoing today.

Sorry for inconvenience, we will post a new patchset including both
series and add supporting data at each major stage when possible.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-19  6:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18  8:26 [RFC -V2 0/8] autonuma: Optimize memory placement in memory tiering system Huang, Ying
2020-02-18  8:26 ` [RFC -V2 1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
2020-02-18  8:26 ` [RFC -V2 2/8] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
2020-02-18  8:57   ` Mel Gorman
2020-02-19  6:01     ` Huang, Ying
2020-02-18  8:26 ` [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
2020-02-18  9:09   ` Mel Gorman
2020-02-19  6:05     ` Huang, Ying
2020-02-18  8:26 ` [RFC -V2 4/8] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
2020-02-18  8:26 ` [RFC -V2 5/8] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
2020-02-18  8:26 ` [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
2020-02-18  8:26 ` [RFC -V2 7/8] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
2020-02-18  8:26 ` [RFC -V2 8/8] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).