[RFC 00/10] autonuma: Optimize memory placement in memory tiering system

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 00/10] autonuma: Optimize memory placement in memory tiering system
@ 2019-11-01  7:57 Huang, Ying
  2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

With the advent of various new memory types, there may be multiple
memory types in one machine, e.g. DRAM and PMEM (persistent memory).
Because the performance and cost of the different types of memory may
be different, the memory subsystem of the machine could be called
memory tiering system.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as cost-effective
volatile memory in separate NUMA nodes.  In a typical memory tiering
system, there are CPUs, DRAM and PMEM in each physical NUMA node.  The
CPUs and the DRAM will be put in one logical node, while the PMEM will
be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

While in autonuma, there are a set of existing mechanisms to identify
the pages recently accessed by the CPUs in a node and migrate the
pages to the node.  So we can reuse these mechanisms to build
mechanisms to optimize page placement in the memory tiering system.
This has been implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In the following patchset,

[PATCH 0/4] [RFC] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20191016221148.F9CCD155@viggo.jf.intel.com/

A mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node, so that the hot PMEM pages can be migrated to the DRAM node.
This has been implemented in this patchset too.

The patchset is based on the following not-yet-merged patchset:

[PATCH 0/4] [RFC] Migrate Pages in lieu of discard
https://lore.kernel.org/linux-mm/20191016221148.F9CCD155@viggo.jf.intel.com/

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from here.

    http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com

With all above optimization, the score of pmbench memory accessing
benchmark with 80:20 read/write ratio and normal access address
distribution improves 116% on a 2 socket Intel server with Optane DC
Persistent Memory.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat()
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01 11:11   ` Mel Gorman
  2019-11-01  7:57 ` [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables Huang, Ying
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

When zone_watermark_ok() is called in migrate_balanced_pgdat() to
check migration target node, the parameter classzone_idx (for
requested zone) is specified as 0 (ZONE_DMA).  But when allocating
memory for autonuma in alloc_misplaced_dst_page(), the requested zone
from GFP flags is ZONE_MOVABLE.  That is, the requested zone is
different.  The size of lowmem_reserve for the different requested
zone is different.  And this may cause some issues.

For example, in the zoneinfo of a test machine as below,

Node 0, zone    DMA32
  pages free     61592
        min      29
        low      454
        high     879
        spanned  1044480
        present  442306
        managed  425921
        protection: (0, 0, 62457, 62457, 62457)

The free page number of ZONE_DMA32 is greater than "high watermark +
lowmem_reserve[ZONE_DMA]", but less than "high watermark +
lowmem_reserve[ZONE_MOVABLE]".  And because __alloc_pages_node() in
alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may
always return true.  So, autonuma may not stop even when memory
pressure in node 0 is heavy.

To fix the issue, ZONE_MOVABLE is used as parameter to call
zone_watermark_ok() in migrate_balanced_pgdat().  This makes it same
as requested zone in alloc_misplaced_dst_page().  So that
migrate_balanced_pgdat() returns false when memory pressure is heavy.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 513107baccd3..8f06bd37d927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1954,7 +1954,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 		if (!zone_watermark_ok(zone, 0,
 				       high_wmark_pages(zone) +
 				       nr_migrate_pages,
-				       0, 0))
+				       ZONE_MOVABLE, 0))
 			continue;
 		return true;
 	}
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat()
  2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
@ 2019-11-01 11:11   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2019-11-01 11:11 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Michal Hocko, Rik van Riel, Ingo Molnar, Dave Hansen,
	Dan Williams, Fengguang Wu

On Fri, Nov 01, 2019 at 03:57:18PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> When zone_watermark_ok() is called in migrate_balanced_pgdat() to
> check migration target node, the parameter classzone_idx (for
> requested zone) is specified as 0 (ZONE_DMA).  But when allocating
> memory for autonuma in alloc_misplaced_dst_page(), the requested zone
> from GFP flags is ZONE_MOVABLE.  That is, the requested zone is
> different.  The size of lowmem_reserve for the different requested
> zone is different.  And this may cause some issues.
> 
> For example, in the zoneinfo of a test machine as below,
> 
> Node 0, zone    DMA32
>   pages free     61592
>         min      29
>         low      454
>         high     879
>         spanned  1044480
>         present  442306
>         managed  425921
>         protection: (0, 0, 62457, 62457, 62457)
> 
> The free page number of ZONE_DMA32 is greater than "high watermark +
> lowmem_reserve[ZONE_DMA]", but less than "high watermark +
> lowmem_reserve[ZONE_MOVABLE]".  And because __alloc_pages_node() in
> alloc_misplaced_dst_page() requests ZONE_MOVABLE, the
> zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may
> always return true.  So, autonuma may not stop even when memory
> pressure in node 0 is heavy.
> 
> To fix the issue, ZONE_MOVABLE is used as parameter to call
> zone_watermark_ok() in migrate_balanced_pgdat().  This makes it same
> as requested zone in alloc_misplaced_dst_page().  So that
> migrate_balanced_pgdat() returns false when memory pressure is heavy.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Acked-by: Mel Gorman <mgorman@suse.de>

This patch is independent of the series and should be resent separately.
Alternatively Andrew, please pick this patch up on its own.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
  2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01 11:13   ` Mel Gorman
  2019-11-01  7:57 ` [RFC 03/10] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

In auto NUMA balancing page table scanning, if the pte_protnone() is
true, the PTE needs not to be changed because it's in target state
already.  So other checking on corresponding struct page is
unnecessary too.

So, if we check pte_protnone() firstly for each PTE, we can avoid
unnecessary struct page accessing, so that reduce the cache footprint
of NUMA balancing page table scanning.

In the performance test of pmbench memory accessing benchmark with
80:20 read/write ratio and normal access address distribution on a 2
socket Intel server with Optance DC Persistent Memory, perf profiling
shows that the autonuma page table scanning time reduces from 1.23% to
0.97% (that is, reduced 21%) with the patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/mprotect.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index bf38dfbbb4b4..d69b9913388e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -80,6 +80,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			if (prot_numa) {
 				struct page *page;
 
+				/* Avoid TLB flush if possible */
+				if (pte_protnone(oldpte))
+					continue;
+
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
 					continue;
@@ -97,10 +101,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				if (page_is_file_cache(page) && PageDirty(page))
 					continue;
 
-				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
-					continue;
-
 				/*
 				 * Don't mess with PTEs if page is already on the node
 				 * a single-threaded process is running on.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables
  2019-11-01  7:57 ` [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables Huang, Ying
@ 2019-11-01 11:13   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2019-11-01 11:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Michal Hocko, Rik van Riel, Ingo Molnar, Dave Hansen,
	Dan Williams, Fengguang Wu

On Fri, Nov 01, 2019 at 03:57:19PM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.huang@intel.com>
> 
> In auto NUMA balancing page table scanning, if the pte_protnone() is
> true, the PTE needs not to be changed because it's in target state
> already.  So other checking on corresponding struct page is
> unnecessary too.
> 
> So, if we check pte_protnone() firstly for each PTE, we can avoid
> unnecessary struct page accessing, so that reduce the cache footprint
> of NUMA balancing page table scanning.
> 
> In the performance test of pmbench memory accessing benchmark with
> 80:20 read/write ratio and normal access address distribution on a 2
> socket Intel server with Optance DC Persistent Memory, perf profiling
> shows that the autonuma page table scanning time reduces from 1.23% to
> 0.97% (that is, reduced 21%) with the patch.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

Acked-by: Mel Gorman <mgorman@suse.de>

This patch is independent of the series and should be resent separately.
Alternatively Andrew, please pick this patch up on its own.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 03/10] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
  2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
  2019-11-01  7:57 ` [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 04/10] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

With the advent of various new memory types, some machines will have
multiple memory types, e.g. DRAM and PMEM (persistent memory).
Because the performance of the different types of memory may be
different, the memory subsystem could be called memory tiering system.

In a typical memory tiering system, there are CPUs, fast memory and
slow memory in each physical NUMA node.  The CPUs and the fast memory
will be put in one logical node (called fast memory node), while the
slow memory will be put in another (faked) logical node (called slow
memory node).  And in autonuma, there are a set of mechanisms to
identify the pages recently accessed by the CPUs in a node and migrate
the pages to the node.  So the performance optimization to promote the
hot pages in slow memory node to the fast memory node in the memory
tiering system could be implemented based on the autonuma framework.

But the requirement of the hot page promotion in the memory tiering
system is different from that of the normal NUMA balancing in some
aspects.  E.g. for the hot page promotion, we can skip to scan fastest
memory node because we have nowhere to promote the hot pages to.  To
make autonuma works for both the normal NUMA balancing and the memory
tiering hot page promotion, we have defined a set of flags and made
the value of sysctl_numa_balancing_mode to be "OR" of these flags.
The flags are as follows,

- 0x0: NUMA_BALANCING_DISABLED
- 0x1: NUMA_BALANCING_NORMAL
- 0x2: NUMA_BALANCING_MEMORY_TIERING

NUMA_BALANCING_NORMAL enables the normal NUMA balancing across
sockets, while NUMA_BALANCING_MEMORY_TIERING enables the hot page
promotion across memory tiers.  They can be enabled individually or
together.  If all flags are cleared, the autonuma is disabled
completely.

The sysctl interface is extended accordingly in a backward compatible
way.

TODO:

- Update ABI document: Documentation/sysctl/kernel.txt

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/sched/sysctl.h | 5 +++++
 kernel/sched/core.c          | 9 +++------
 kernel/sysctl.c              | 7 ++++---
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 99ce6d728df7..5cfe38783c60 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -33,6 +33,11 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+#define NUMA_BALANCING_DISABLED		0x0
+#define NUMA_BALANCING_NORMAL		0x1
+#define NUMA_BALANCING_MEMORY_TIERING	0x2
+
+extern int sysctl_numa_balancing_mode;
 extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..6f490e2fd45e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2169,6 +2169,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
+int sysctl_numa_balancing_mode;
 
 #ifdef CONFIG_NUMA_BALANCING
 
@@ -2184,20 +2185,16 @@ void set_numabalancing_state(bool enabled)
 int sysctl_numa_balancing(struct ctl_table *table, int write,
 			 void __user *buffer, size_t *lenp, loff_t *ppos)
 {
-	struct ctl_table t;
 	int err;
-	int state = static_branch_likely(&sched_numa_balancing);
 
 	if (write && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	t = *table;
-	t.data = &state;
-	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 	if (err < 0)
 		return err;
 	if (write)
-		set_numabalancing_state(state);
+		set_numabalancing_state(*(int *)table->data);
 	return err;
 }
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1beca96fb625..442acedb1fe7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -129,6 +129,7 @@ static int __maybe_unused neg_one = -1;
 static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
+static int __maybe_unused three = 3;
 static int __maybe_unused four = 4;
 static unsigned long zero_ul;
 static unsigned long one_ul = 1;
@@ -422,12 +423,12 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.procname	= "numa_balancing",
-		.data		= NULL, /* filled in by handler */
-		.maxlen		= sizeof(unsigned int),
+		.data		= &sysctl_numa_balancing_mode,
+		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= sysctl_numa_balancing,
 		.extra1		= &zero,
-		.extra2		= &one,
+		.extra2		= &three,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 04/10] autonuma, memory tiering: Rate limit NUMA migration throughput
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (2 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 03/10] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 05/10] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

In autonuma memory tiering mode, the hot PMEM (persistent memory)
pages could be migrated to DRAM via autonuma.  But this incurs some
overhead too.  So that sometimes the workload performance may be hurt.
To avoid too much disturbing to the workload, the migration throughput
should be rate-limited.

At the other hand, in some situation, for example, some workloads
exits, many DRAM pages become free, so that some pages of the other
workloads can be migrated to DRAM.  To respond to the workloads
changing quickly, it's better to migrate pages faster.

To address the above 2 requirements, a rate limit algorithm as follows
is used,

- If there is enough free memory in DRAM node (that is, > high
  watermark + 2 * rate limit pages), then NUMA migration throughput will
  not be rate-limited to respond to the workload changing quickly.

- Otherwise, counting the number of pages to try to migrate to a DRAM
  node via autonuma, if the count exceeds the limit specified by the
  users, stop NUMA migration until the next second.

A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
the users to specify the limit.  If its value is 0, the default
value (high watermark) will be used.

TODO: Add ABI document for new sysctl knob.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mmzone.h       |  7 ++++
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/fair.c          | 63 ++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  8 +++++
 mm/vmstat.c                  |  3 ++
 5 files changed, 82 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d0b6ececaf39..46382b058546 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -247,6 +247,9 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+#ifdef CONFIG_NUMA_BALANCING
+	NUMA_TRY_MIGRATE,	/* pages to try to migrate via NUMA balancing */
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
@@ -766,6 +769,10 @@ typedef struct pglist_data {
 	unsigned long split_queue_len;
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned long autonuma_jiffies;
+	unsigned long autonuma_try_migrate;
+#endif
 	/* Fields commonly accessed by the page reclaim scanner */
 	struct lruvec		lruvec;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5cfe38783c60..e3616889a91c 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -42,6 +42,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_rate_limit;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..489e2e21bb5d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1047,6 +1047,12 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
+/*
+ * Restrict the NUMA migration per second in MB for each target node
+ * if no enough free space in target node
+ */
+unsigned int sysctl_numa_balancing_rate_limit;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -1397,6 +1403,44 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
 	return 1000 * faults / total_faults;
 }
 
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long rate_limit;
+
+	rate_limit = sysctl_numa_balancing_rate_limit << (20 - PAGE_SHIFT);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      high_wmark_pages(zone) + rate_limit * 2,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
+					    unsigned long rate_limit, int nr)
+{
+	unsigned long try_migrate;
+	unsigned long now = jiffies, last_jiffies;
+
+	mod_node_page_state(pgdat, NUMA_TRY_MIGRATE, nr);
+	try_migrate = node_page_state(pgdat, NUMA_TRY_MIGRATE);
+	last_jiffies = pgdat->autonuma_jiffies;
+	if (now > last_jiffies + HZ &&
+	    cmpxchg(&pgdat->autonuma_jiffies, last_jiffies, now) ==
+	    last_jiffies)
+		pgdat->autonuma_try_migrate = try_migrate;
+	if (try_migrate - pgdat->autonuma_try_migrate > rate_limit)
+		return false;
+	return true;
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 				int src_nid, int dst_cpu)
 {
@@ -1404,6 +1448,25 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
 
+	/*
+	 * If memory tiering mode is enabled, will try promote pages
+	 * in slow memory node to fast memory node.
+	 */
+	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+	    next_promotion_node(src_nid) != -1) {
+		struct pglist_data *pgdat;
+		unsigned long rate_limit;
+
+		pgdat = NODE_DATA(dst_nid);
+		if (pgdat_free_space_enough(pgdat))
+			return true;
+
+		rate_limit = sysctl_numa_balancing_rate_limit <<
+			(20 - PAGE_SHIFT);
+		return numa_migration_check_rate_limit(pgdat, rate_limit,
+						       hpage_nr_pages(page));
+	}
+
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 442acedb1fe7..c455ff404436 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -421,6 +421,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "numa_balancing_rate_limit_mbps",
+		.data		= &sysctl_numa_balancing_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= &sysctl_numa_balancing_mode,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d4bbf9d69c06..c7ec14673ba8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1198,6 +1198,9 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+#ifdef CONFIG_NUMA_BALANCING
+	"numa_try_migrate",
+#endif
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 05/10] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (3 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 04/10] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 06/10] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

In a memory tiering system, if the memory size of the workloads is
smaller than that of the faster memory (e.g. DRAM) nodes, all pages of
the workloads should be put in the faster memory nodes.  But this
makes it unnecessary to use slower memory (e.g. PMEM) at all.

So in common cases, the memory size of the workload should be larger
than that of the faster memory nodes.  And to optimize the
performance, the hot pages should be promoted to the faster memory
nodes while the cold pages should be demoted to the slower memory
nodes.  To achieve that, we have two choices,

a. Promote the hot pages from the slower memory node to the faster
   memory node.  This will create some memory pressure in the faster
   memory node, thus trigger the memory reclaiming, where the cold
   pages will be demoted to the slower memory node.

b. Demote the cold pages from faster memory node to the slower memory
   node.  This will create some free memory space in the faster memory
   node, and the hot pages in the slower memory node could be promoted
   to the faster memory node.

The choice "a" will create the memory pressure in the faster memory
node.  If the memory pressure of the workload is high too, the memory
pressure may become so high that the memory allocation latency of the
workload is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect.  If the memory
pressure of the workload is high, it will consume the free memory and
the hot pages promotion will stop earlier if its allocation watermark
is higher than that of the normal memory allocation.

In this patch, choice "b" is implemented.  If memory tiering NUMA
balancing mode is enabled, the node isn't the slowest node, and the
free memory size of the node is below the high watermark, the kswapd
of the node will be waken up to free some memory until the free memory
size is above the high watermark + autonuma promotion rate limit.  If
the free memory size is below the high watermark, autonuma promotion
will stop working.  This avoids to create too much memory pressure to
the system.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/migrate.c | 26 +++++++++++++++++---------
 mm/vmscan.c  |  7 +++++++
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8f06bd37d927..dd9e31416c10 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -47,6 +47,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/sched/sysctl.h>

 #include <asm/tlbflush.h>

@@ -1939,8 +1940,7 @@ COMPAT_SYSCALL_DEFINE6(move_pages, pid_t, pid, compat_ulong_t, nr_pages,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which crude
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
-				   unsigned long nr_migrate_pages)
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order)
 {
 	int z;

@@ -1951,12 +1951,9 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 			continue;

 		/* Avoid waking kswapd by allocating pages_to_migrate pages. */
-		if (!zone_watermark_ok(zone, 0,
-				       high_wmark_pages(zone) +
-				       nr_migrate_pages,
-				       ZONE_MOVABLE, 0))
-			continue;
-		return true;
+		if (zone_watermark_ok(zone, order, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			return true;
 	}
 	return false;
 }
@@ -1983,8 +1980,19 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);

 	/* Avoid migrating to a node that is nearly full */
-	if (!migrate_balanced_pgdat(pgdat, 1UL << compound_order(page)))
+	if (!migrate_balanced_pgdat(pgdat, compound_order(page))) {
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) {
+			int z;
+
+			for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+				if (populated_zone(pgdat->node_zones + z))
+					break;
+			}
+			wakeup_kswapd(pgdat->node_zones + z,
+				      0, compound_order(page), ZONE_MOVABLE);
+		}
 		return 0;
+	}

 	if (isolate_lru_page(page))
 		return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6867b49ce360..ecc7f66ee2c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@

 #include <linux/swapops.h>
 #include <linux/balloon_compaction.h>
+#include <linux/sched/sysctl.h>

 #include "internal.h"

@@ -3336,8 +3337,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	int i;
 	unsigned long mark = -1;
+	unsigned long promote_ratelimit;
 	struct zone *zone;

+	promote_ratelimit = sysctl_numa_balancing_rate_limit <<
+		(20 - PAGE_SHIFT);
 	/*
 	 * Check watermarks bottom-up as lower zones are more likely to
 	 * meet watermarks.
@@ -3349,6 +3353,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 			continue;

 		mark = high_wmark_pages(zone);
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+		    next_migration_node(pgdat->node_id) != -1)
+			mark += promote_ratelimit;
 		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 			return true;
 	}
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 06/10] autonuma, memory tiering: Skip to scan fastest memory
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (4 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 05/10] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 07/10] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

In memory tiering NUMA balancing mode, the hot pages of the workload
in the fastest memory node couldn't be promoted to anywhere, so it's
unnecessary to identify the hot pages in the fastest memory node via
changing their PTE mapping to have PROT_NONE.  So that the page faults
could be avoided too.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
4.6% on a 2 socket Intel server with Optance DC Persistent Memory.
The autonuma hint faults for DRAM node is reduced to almost 0 in the
test.

Known problem: the statistics of autonuma such as per-node memory
accesses, and local/remote ratio, etc. will be influenced.  Especially
the NUMA scanning period automatic adjustment will not work
reasonably.  So we cannot rely on that.  Fortunately, there's no CPU
in the PMEM NUMA nodes, so we will not move tasks there because of
the statistics issue.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/huge_memory.c | 30 +++++++++++++++++++++---------
 mm/mprotect.c    | 14 +++++++++++++-
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 885642c82aaa..61e241ce20fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -32,6 +32,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/sched/sysctl.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1937,17 +1938,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 #endif
 
-	/*
-	 * Avoid trapping faults against the zero page. The read-only
-	 * data is likely to be read-cached on the local CPU and
-	 * local/remote hits to the zero page are not interesting.
-	 */
-	if (prot_numa && is_huge_zero_pmd(*pmd))
-		goto unlock;
+	if (prot_numa) {
+		struct page *page;
+		/*
+		 * Avoid trapping faults against the zero page. The read-only
+		 * data is likely to be read-cached on the local CPU and
+		 * local/remote hits to the zero page are not interesting.
+		 */
+		if (is_huge_zero_pmd(*pmd))
+			goto unlock;
 
-	if (prot_numa && pmd_protnone(*pmd))
-		goto unlock;
+		if (pmd_protnone(*pmd))
+			goto unlock;
 
+		page = pmd_page(*pmd);
+		/*
+		 * Skip if normal numa balancing is disabled and no
+		 * faster memory node to promote to
+		 */
+		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+		    next_promotion_node(page_to_nid(page)) == -1)
+			goto unlock;
+	}
 	/*
 	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d69b9913388e..0636f2e5e05b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,6 +28,7 @@
 #include <linux/ksm.h>
 #include <linux/uaccess.h>
 #include <linux/mm_inline.h>
+#include <linux/sched/sysctl.h>
 #include <asm/pgtable.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
@@ -79,6 +80,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			 */
 			if (prot_numa) {
 				struct page *page;
+				int nid;
 
 				/* Avoid TLB flush if possible */
 				if (pte_protnone(oldpte))
@@ -105,7 +107,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				 * Don't mess with PTEs if page is already on the node
 				 * a single-threaded process is running on.
 				 */
-				if (target_node == page_to_nid(page))
+				nid = page_to_nid(page);
+				if (target_node == nid)
+					continue;
+
+				/*
+				 * Skip scanning if normal numa
+				 * balancing is disabled and no faster
+				 * memory node to promote to
+				 */
+				if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+				    next_promotion_node(nid) == -1)
 					continue;
 			}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 07/10] autonuma, memory tiering: Only promote page if accessed twice
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (5 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 06/10] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

The original assumption of auto NUMA balancing is that the memory
privately or mainly accessed by the CPUs of a NUMA node (called
private memory) should fit the memory size of the NUMA node.  So if a
page is identified to be private memory, it will be migrated to the
target node immediately.  Eventually all private memory will be
migrated.

But this assumption isn't true in memory tiering system.  In a typical
memory tiering system, there are CPUs, fast memory, and slow memory in
each physical NUMA node.  The CPUs and the fast memory will be put in
one logical node (called fast memory node), while the slow memory will
be put in another (faked) logical node (called slow memory node).  To
take full advantage of the system resources, it's common that the size
of the private memory of the workload is larger than the memory size
of the fast memory node.

To resolve the issue, we will try to migrate only the hot pages in the
private memory to the fast memory node.  A private page that was
accessed at least twice in the current and the last scanning periods
will be identified as the hot page and migrated.  Otherwise, the page
isn't considered hot enough to be migrated.

To record whether a page is accessed in the last scanning period, the
Accessed bit of the PTE/PMD is used.  When the page tables are scanned
for autonuma, if the pte_protnone(pte) is true, the page isn't
accessed in the last scan period, and the Accessed bit will be
cleared, otherwise the Accessed bit will be kept.  When NUMA page
fault occurs, if the Accessed bit is set, the page has been accessed
at least twice in the current and the last scanning period and will be
migrated.

The Accessed bit of PTE/PMD is used by page reclaiming too.  So the
conflict is possible.  Considering the following situation,

a) the page is moved from active list to inactive list with Accessed
   bit cleared

b) the page is accessed, so Accessed bit is set

c) the page table is scanned by autonuma, PTE is set to
   PROTNONE+Accessed

c) the page isn't accessed

d) the page table is scanned by autonuma again, Accessed is cleared

e) the inactive list is scanned for reclaiming, the page is reclaimed
   wrongly because Accessed bit is cleared by autonuma

Although the page is reclaimed wrongly, it hasn't been accessed for
one numa balancing scanning period at least.  So the page isn't so hot
too.  That is, this shouldn't be a severe issue.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
3.1% on a 2 socket Intel server with Optance DC Persistent Memory.  In
the test, the number of the promoted pages for autonuma reduces 7.2%
because the pages fail to pass the twice access checking.

Problems:

 - how to adjust scanning period upon hot page identification
   requirement.  E.g. if the count of page promotion is much larger
   than free memory, we need to scan faster to identify really hot
   pages.  But his will trigger too many page faults too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 mm/huge_memory.c | 13 ++++++++++++-
 mm/memory.c      | 28 +++++++++++++++-------------
 mm/mprotect.c    | 11 ++++++++++-
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 61e241ce20fa..7634fb22931b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1528,6 +1528,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
 		goto out_unlock;
 
+	/* Only migrate if accessed twice */
+	if (!pmd_young(*vmf->pmd))
+		goto out_unlock;
+
 	/*
 	 * If there are potential migrations, wait for completion and retry
 	 * without disrupting NUMA hinting information. Do not relock and
@@ -1948,8 +1952,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (is_huge_zero_pmd(*pmd))
 			goto unlock;
 
-		if (pmd_protnone(*pmd))
+		if (pmd_protnone(*pmd)) {
+			/*
+			 * PMD young bit is used to record whether the
+			 * page is accessed in last scan period
+			 */
+			if (pmd_young(*pmd))
+				set_pmd_at(mm, addr, pmd, pmd_mkold(*pmd));
 			goto unlock;
+		}
 
 		page = pmd_page(*pmd);
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..e5da50eca36f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3711,10 +3711,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	 */
 	vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		goto out;
-	}
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+		goto unmap_out;
 
 	/*
 	 * Make it present again, Depending on how arch implementes non
@@ -3728,17 +3726,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
+	/* Only migrate if accessed twice */
+	if (!pte_young(old_pte))
+		goto unmap_out;
+
 	page = vm_normal_page(vma, vmf->address, pte);
-	if (!page) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (!page)
+		goto unmap_out;
 
 	/* TODO: handle PTE-mapped THP */
-	if (PageCompound(page)) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
+	if (PageCompound(page))
+		goto unmap_out;
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -3776,10 +3774,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	} else
 		flags |= TNF_MIGRATE_FAIL;
 
-out:
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, 1, flags);
 	return 0;
+
+unmap_out:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
+	return 0;
 }
 
 static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0636f2e5e05b..fd8d8e813717 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -83,8 +83,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				int nid;
 
 				/* Avoid TLB flush if possible */
-				if (pte_protnone(oldpte))
+				if (pte_protnone(oldpte)) {
+					/*
+					 * PTE young bit is used to record
+					 * whether the page is accessed in
+					 * last scan period
+					 */
+					if (pte_young(oldpte))
+						set_pte_at(vma->vm_mm, addr, pte,
+							   pte_mkold(oldpte));
 					continue;
+				}
 
 				page = vm_normal_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (6 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 07/10] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  9:24   ` Peter Zijlstra
  2019-11-01  7:57 ` [RFC 09/10] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
  2019-11-01  7:57 ` [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
  9 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

In memory tiering system, to maximize the overall system performance,
the hot pages should be put in the fast memory node while the cold
pages should be put in the slow memory node.  In original memory
tiering autonuma implementation, we will try to promote almost all
recently accessed pages, and use the LRU algorithm in page reclaiming
to keep the hot pages in the fast memory node and demote the cold
pages to the slow memory node.  The problem of this solution is that
the cold pages with a low access frequency may be promoted then
demoted too.  So that the memory bandwidth is wasted.  And because
migration is rate-limited, the hot pages need to compete with the cold
pages for the limited migration bandwidth.

If we could select the hotter pages to promote to the fast memory node
in the first place, then the wasted migration bandwidth would be
reduced and the hot pages would be promoted more quickly.

The patch "autonuma, memory tiering: Only promote page if accessed
twice" in the series will prevent the really cold pages that are not
accessed in the last scan period from being promoted.  But the scan
period could be as long as tens seconds, so it doesn't work well
enough on selecting the hotter pages.

To identify the hotter pages, in this patch we implemented a method
based on autonuma page table scanning and hint page fault as follow,

- When a range of the page table is scanned in autonuma, the timestamp
  and the address range is recorded in a ring buffer in struct
  mm_struct.  So we have information of recent N scans.

- When the autonuma hint page fault occurs, the fault address is
  searched in the ring buffer to get its scanning timestamp.  The hint
  page fault latency is defined as

    hint page fault timestamp - scan timestamp

  If the access frequency of the hotter pages is higher, the
  probability for their hint page fault latency to be shorter is
  higher too.  So the hint page fault latency is a good estimation of
  the page heat.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
9.2% with 50.3% fewer NUMA page migrations on a 2 socket Intel server
with Optance DC Persistent Memory.  That is, the cost of autonuma page
migration reduces considerably.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency may not be shorter than the hot threshold.  So the pages may
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node (> high
  watermark + 2 * promotion rate limit), the hot threshold will not be
  used, all pages will be promoted upon the hint page fault for fast
  response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mempolicy.h            |  5 +-
 include/linux/mm_types.h             |  5 ++
 include/linux/sched/numa_balancing.h |  8 ++-
 include/linux/sched/sysctl.h         |  1 +
 kernel/sched/fair.c                  | 83 +++++++++++++++++++++++++---
 kernel/sysctl.c                      |  7 +++
 mm/huge_memory.c                     |  6 +-
 mm/memory.c                          |  7 +--
 mm/mempolicy.c                       |  7 ++-
 9 files changed, 109 insertions(+), 20 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..674aaa7614ed 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,7 +202,8 @@ static inline bool vma_migratable(struct vm_area_struct *vma)
 	return true;
 }
 
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long,
+			  int flags);
 extern void mpol_put_task_policy(struct task_struct *);
 
 #else
@@ -300,7 +301,7 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol)
 #endif
 
 static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address)
+				 unsigned long address, int flags)
 {
 	return -1; /* no node preference */
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8ec38b11b361..59e2151734ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -484,6 +484,11 @@ struct mm_struct {
 
 		/* numa_scan_seq prevents two threads setting pte_numa */
 		int numa_scan_seq;
+
+#define NUMA_SCAN_NR_HIST	16
+		int numa_scan_idx;
+		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
+		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];
 #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index e7dd04a84ba8..e1c2728d5bb2 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -14,6 +14,7 @@
 #define TNF_SHARED	0x04
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
+#define TNF_YOUNG	0x20
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
@@ -21,7 +22,8 @@ extern pid_t task_numa_group_id(struct task_struct *p);
 extern void set_numabalancing_state(bool enabled);
 extern void task_numa_free(struct task_struct *p);
 extern bool should_numa_migrate_memory(struct task_struct *p, struct page *page,
-					int src_nid, int dst_cpu);
+				       int src_nid, int dst_cpu,
+				       unsigned long addr, int flags);
 #else
 static inline void task_numa_fault(int last_node, int node, int pages,
 				   int flags)
@@ -38,7 +40,9 @@ static inline void task_numa_free(struct task_struct *p)
 {
 }
 static inline bool should_numa_migrate_memory(struct task_struct *p,
-				struct page *page, int src_nid, int dst_cpu)
+					      struct page *page, int src_nid,
+					      int dst_cpu, unsigned long addr,
+					      int flags)
 {
 	return true;
 }
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index e3616889a91c..5fc444024ec6 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -43,6 +43,7 @@ extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_rate_limit;
+extern unsigned int sysctl_numa_balancing_hot_threshold;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 489e2e21bb5d..d6cf5832556e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1053,6 +1053,9 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
  */
 unsigned int sysctl_numa_balancing_rate_limit;
 
+/* The page with hint page fault latency < threshold in ms is considered hot */
+unsigned int sysctl_numa_balancing_hot_threshold = 1000;
+
 struct numa_group {
 	refcount_t refcount;
 
@@ -1158,7 +1161,7 @@ static unsigned int task_scan_max(struct task_struct *p)
 
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
-	int mm_users = 0;
+	int mm_users = 0, i;
 	struct mm_struct *mm = p->mm;
 
 	if (mm) {
@@ -1166,6 +1169,11 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 		if (mm_users == 1) {
 			mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 			mm->numa_scan_seq = 0;
+			mm->numa_scan_idx = 0;
+			for (i = 0; i < NUMA_SCAN_NR_HIST; i++) {
+				mm->numa_scan_jiffies[i] = 0;
+				mm->numa_scan_starts[i] = 0;
+			}
 		}
 	}
 	p->node_stamp			= 0;
@@ -1423,6 +1431,43 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
 	return false;
 }
 
+static long numa_hint_fault_latency(struct task_struct *p, unsigned long addr)
+{
+	struct mm_struct *mm = p->mm;
+	unsigned long now = jiffies;
+	unsigned long start, end;
+	int i, j;
+	long latency = 0;
+
+	i = READ_ONCE(mm->numa_scan_idx);
+	i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;
+	/*
+	 * Paired with smp_wmb() in task_numa_work() to check
+	 * scan range buffer after get current index
+	 */
+	smp_rmb();
+	end = READ_ONCE(mm->numa_scan_offset);
+	start = READ_ONCE(mm->numa_scan_starts[i]);
+	if (start == end)
+		end = start + MAX_SCAN_WINDOW * (1UL << 22);
+	for (j = 0; j < NUMA_SCAN_NR_HIST; j++) {
+		latency = now - READ_ONCE(mm->numa_scan_jiffies[i]);
+		start = READ_ONCE(mm->numa_scan_starts[i]);
+		/* Scan pass the end of address space */
+		if (end < start)
+			end = TASK_SIZE;
+		if (addr >= start && addr < end)
+			return latency;
+		end = start;
+		i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;
+	}
+	/*
+	 * The tracking window isn't large enough, approximate to the
+	 * max latency in the tracking window.
+	 */
+	return latency;
+}
+
 static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 					    unsigned long rate_limit, int nr)
 {
@@ -1442,7 +1487,8 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
-				int src_nid, int dst_cpu)
+				int src_nid, int dst_cpu, unsigned long addr,
+				int flags)
 {
 	struct numa_group *ng = p->numa_group;
 	int dst_nid = cpu_to_node(dst_cpu);
@@ -1455,12 +1501,22 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 	    next_promotion_node(src_nid) != -1) {
 		struct pglist_data *pgdat;
-		unsigned long rate_limit;
+		unsigned long rate_limit, latency, threshold;
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat))
 			return true;
 
+		/* The page hasn't been accessed in the last scan period */
+		if (!(flags & TNF_YOUNG))
+			return false;
+
+		threshold = msecs_to_jiffies(
+			sysctl_numa_balancing_hot_threshold);
+		latency = numa_hint_fault_latency(p, addr);
+		if (latency > threshold)
+			return false;
+
 		rate_limit = sysctl_numa_balancing_rate_limit <<
 			(20 - PAGE_SHIFT);
 		return numa_migration_check_rate_limit(pgdat, rate_limit,
@@ -2508,7 +2564,7 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	 * expensive, to avoid any form of compiler optimizations:
 	 */
 	WRITE_ONCE(p->mm->numa_scan_seq, READ_ONCE(p->mm->numa_scan_seq) + 1);
-	p->mm->numa_scan_offset = 0;
+	WRITE_ONCE(p->mm->numa_scan_offset, 0);
 }
 
 /*
@@ -2525,6 +2581,7 @@ void task_numa_work(struct callback_head *work)
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
+	int idx;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -2583,6 +2640,19 @@ void task_numa_work(struct callback_head *work)
 		start = 0;
 		vma = mm->mmap;
 	}
+	idx = mm->numa_scan_idx;
+	WRITE_ONCE(mm->numa_scan_starts[idx], start);
+	WRITE_ONCE(mm->numa_scan_jiffies[idx], jiffies);
+	/*
+	 * Paired with smp_rmb() in should_numa_migrate_memory() to
+	 * update scan range buffer index after update the buffer
+	 * contents.
+	 */
+	smp_wmb();
+	if (idx + 1 >= NUMA_SCAN_NR_HIST)
+		WRITE_ONCE(mm->numa_scan_idx, 0);
+	else
+		WRITE_ONCE(mm->numa_scan_idx, idx + 1);
 	for (; vma; vma = vma->vm_next) {
 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
@@ -2610,6 +2680,7 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
+			WRITE_ONCE(mm->numa_scan_offset, end);
 			nr_pte_updates = change_prot_numa(vma, start, end);
 
 			/*
@@ -2639,9 +2710,7 @@ void task_numa_work(struct callback_head *work)
 	 * would find the !migratable VMA on the next scan but not reset the
 	 * scanner to the start so check it now.
 	 */
-	if (vma)
-		mm->numa_scan_offset = start;
-	else
+	if (!vma)
 		reset_ptenuma_scan(p);
 	up_read(&mm->mmap_sem);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c455ff404436..b7c2e15d322d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -429,6 +429,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "numa_balancing_hot_threshold_ms",
+		.data		= &sysctl_numa_balancing_hot_threshold,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{
 		.procname	= "numa_balancing",
 		.data		= &sysctl_numa_balancing_mode,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7634fb22931b..9177cc2febd4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1529,8 +1529,8 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 		goto out_unlock;
 
 	/* Only migrate if accessed twice */
-	if (!pmd_young(*vmf->pmd))
-		goto out_unlock;
+	if (pmd_young(*vmf->pmd))
+		flags |= TNF_YOUNG;
 
 	/*
 	 * If there are potential migrations, wait for completion and retry
@@ -1565,7 +1565,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 * page_table_lock if at all possible
 	 */
 	page_locked = trylock_page(page);
-	target_nid = mpol_misplaced(page, vma, haddr);
+	target_nid = mpol_misplaced(page, vma, haddr, flags);
 	if (target_nid == NUMA_NO_NODE) {
 		/* If the page was locked, there are no parallel migrations */
 		if (page_locked)
diff --git a/mm/memory.c b/mm/memory.c
index e5da50eca36f..80902ff7f5de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3689,7 +3689,7 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		*flags |= TNF_FAULT_LOCAL;
 	}
 
-	return mpol_misplaced(page, vma, addr);
+	return mpol_misplaced(page, vma, addr, *flags);
 }
 
 static vm_fault_t do_numa_page(struct vm_fault *vmf)
@@ -3726,9 +3726,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 
-	/* Only migrate if accessed twice */
-	if (!pte_young(old_pte))
-		goto unmap_out;
+	if (pte_young(old_pte))
+		flags |= TNF_YOUNG;
 
 	page = vm_normal_page(vma, vmf->address, pte);
 	if (!page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5a13bc52172f..28f803fabf5d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2315,6 +2315,7 @@ static void sp_free(struct sp_node *n)
  * @page: page to be checked
  * @vma: vm area where page mapped
  * @addr: virtual address where page mapped
+ * @flags: numa balancing flags
  *
  * Lookup current policy node id for vma,addr and "compare to" page's
  * node id.
@@ -2326,7 +2327,8 @@ static void sp_free(struct sp_node *n)
  * Policy determination "mimics" alloc_page_vma().
  * Called from fault path where we know the vma and faulting address.
  */
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+		   unsigned long addr, int flags)
 {
 	struct mempolicy *pol;
 	struct zoneref *z;
@@ -2380,7 +2382,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	if (pol->flags & MPOL_F_MORON) {
 		polnid = thisnid;
 
-		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
+		if (!should_numa_migrate_memory(current, page, curnid,
+						thiscpu, addr, flags))
 			goto out;
 	}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2019-11-01  7:57 ` [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
@ 2019-11-01  9:24   ` Peter Zijlstra
  2019-11-04  2:41     ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-11-01  9:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

On Fri, Nov 01, 2019 at 03:57:25PM +0800, Huang, Ying wrote:
> index 8ec38b11b361..59e2151734ab 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -484,6 +484,11 @@ struct mm_struct {
>  
>  		/* numa_scan_seq prevents two threads setting pte_numa */
>  		int numa_scan_seq;
> +
> +#define NUMA_SCAN_NR_HIST	16
> +		int numa_scan_idx;
> +		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
> +		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];

Why 16? This is 4 cachelines.

>  #endif
>  		/*
>  		 * An operation with batched TLB flushing is going on. Anything

> +static long numa_hint_fault_latency(struct task_struct *p, unsigned long addr)
> +{
> +	struct mm_struct *mm = p->mm;
> +	unsigned long now = jiffies;
> +	unsigned long start, end;
> +	int i, j;
> +	long latency = 0;
> +
> +	i = READ_ONCE(mm->numa_scan_idx);
> +	i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;
> +	/*
> +	 * Paired with smp_wmb() in task_numa_work() to check
> +	 * scan range buffer after get current index
> +	 */
> +	smp_rmb();

That wants to be:

	i = smp_load_acquire(&mm->numa_scan_idx)
	i = (i - 1) % NUMA_SCAN_NR_HIST;

(and because NUMA_SCAN_NR_HIST is a power of 2, the compiler will
conveniently make that a bitwise and operation)

And: "DEC %0; AND $15, %0" is so much faster than a branch.

> +	end = READ_ONCE(mm->numa_scan_offset);
> +	start = READ_ONCE(mm->numa_scan_starts[i]);
> +	if (start == end)
> +		end = start + MAX_SCAN_WINDOW * (1UL << 22);
> +	for (j = 0; j < NUMA_SCAN_NR_HIST; j++) {
> +		latency = now - READ_ONCE(mm->numa_scan_jiffies[i]);
> +		start = READ_ONCE(mm->numa_scan_starts[i]);
> +		/* Scan pass the end of address space */
> +		if (end < start)
> +			end = TASK_SIZE;
> +		if (addr >= start && addr < end)
> +			return latency;
> +		end = start;
> +		i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;

		i = (i - 1) % NUMA_SCAN_NR_HIST;
> +	}
> +	/*
> +	 * The tracking window isn't large enough, approximate to the
> +	 * max latency in the tracking window.
> +	 */
> +	return latency;
> +}

> @@ -2583,6 +2640,19 @@ void task_numa_work(struct callback_head *work)
>  		start = 0;
>  		vma = mm->mmap;
>  	}
> +	idx = mm->numa_scan_idx;
> +	WRITE_ONCE(mm->numa_scan_starts[idx], start);
> +	WRITE_ONCE(mm->numa_scan_jiffies[idx], jiffies);
> +	/*
> +	 * Paired with smp_rmb() in should_numa_migrate_memory() to
> +	 * update scan range buffer index after update the buffer
> +	 * contents.
> +	 */
> +	smp_wmb();
> +	if (idx + 1 >= NUMA_SCAN_NR_HIST)
> +		WRITE_ONCE(mm->numa_scan_idx, 0);
> +	else
> +		WRITE_ONCE(mm->numa_scan_idx, idx + 1);

	smp_store_release(&mm->nums_scan_idx, idx % NUMA_SCAN_NR_HIST);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2019-11-01  9:24   ` Peter Zijlstra
@ 2019-11-04  2:41     ` Huang, Ying
  2019-11-04  8:44       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-04  2:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

Hi, Peter,

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 01, 2019 at 03:57:25PM +0800, Huang, Ying wrote:
>> index 8ec38b11b361..59e2151734ab 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -484,6 +484,11 @@ struct mm_struct {
>>  
>>  		/* numa_scan_seq prevents two threads setting pte_numa */
>>  		int numa_scan_seq;
>> +
>> +#define NUMA_SCAN_NR_HIST	16
>> +		int numa_scan_idx;
>> +		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
>> +		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];
>
> Why 16? This is 4 cachelines.

We want to keep the NUMA scanning history reasonably long.  From
task_scan_min(), the minimal interval between task_numa_work() running
is about 100 ms by default.  So we can keep 1600 ms history by default
if NUMA_SCAN_NR_HIST is 16.  If user choose to use smaller
sysctl_numa_balancing_scan_size, then we can only keep shorter history.
In general, we want to keep no less than 1000 ms history.  So 16 appears
like a reasonable choice for us.  Any other suggestion?

>>  #endif
>>  		/*
>>  		 * An operation with batched TLB flushing is going on. Anything
>
>> +static long numa_hint_fault_latency(struct task_struct *p, unsigned long addr)
>> +{
>> +	struct mm_struct *mm = p->mm;
>> +	unsigned long now = jiffies;
>> +	unsigned long start, end;
>> +	int i, j;
>> +	long latency = 0;
>> +
>> +	i = READ_ONCE(mm->numa_scan_idx);
>> +	i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;
>> +	/*
>> +	 * Paired with smp_wmb() in task_numa_work() to check
>> +	 * scan range buffer after get current index
>> +	 */
>> +	smp_rmb();
>
> That wants to be:
>
> 	i = smp_load_acquire(&mm->numa_scan_idx)
> 	i = (i - 1) % NUMA_SCAN_NR_HIST;
>
> (and because NUMA_SCAN_NR_HIST is a power of 2, the compiler will
> conveniently make that a bitwise and operation)
>
> And: "DEC %0; AND $15, %0" is so much faster than a branch.

This looks much better.  Thanks!  I will use it in the next version.

>> +	end = READ_ONCE(mm->numa_scan_offset);
>> +	start = READ_ONCE(mm->numa_scan_starts[i]);
>> +	if (start == end)
>> +		end = start + MAX_SCAN_WINDOW * (1UL << 22);
>> +	for (j = 0; j < NUMA_SCAN_NR_HIST; j++) {
>> +		latency = now - READ_ONCE(mm->numa_scan_jiffies[i]);
>> +		start = READ_ONCE(mm->numa_scan_starts[i]);
>> +		/* Scan pass the end of address space */
>> +		if (end < start)
>> +			end = TASK_SIZE;
>> +		if (addr >= start && addr < end)
>> +			return latency;
>> +		end = start;
>> +		i = i ? i - 1 : NUMA_SCAN_NR_HIST - 1;
>
> 		i = (i - 1) % NUMA_SCAN_NR_HIST;

Will use this in the next version.

>> +	}
>> +	/*
>> +	 * The tracking window isn't large enough, approximate to the
>> +	 * max latency in the tracking window.
>> +	 */
>> +	return latency;
>> +}
>
>> @@ -2583,6 +2640,19 @@ void task_numa_work(struct callback_head *work)
>>  		start = 0;
>>  		vma = mm->mmap;
>>  	}
>> +	idx = mm->numa_scan_idx;
>> +	WRITE_ONCE(mm->numa_scan_starts[idx], start);
>> +	WRITE_ONCE(mm->numa_scan_jiffies[idx], jiffies);
>> +	/*
>> +	 * Paired with smp_rmb() in should_numa_migrate_memory() to
>> +	 * update scan range buffer index after update the buffer
>> +	 * contents.
>> +	 */
>> +	smp_wmb();
>> +	if (idx + 1 >= NUMA_SCAN_NR_HIST)
>> +		WRITE_ONCE(mm->numa_scan_idx, 0);
>> +	else
>> +		WRITE_ONCE(mm->numa_scan_idx, idx + 1);
>
> 	smp_store_release(&mm->nums_scan_idx, idx % NUMA_SCAN_NR_HIST);

Will use this in the next version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2019-11-04  2:41     ` Huang, Ying
@ 2019-11-04  8:44       ` Peter Zijlstra
  2019-11-04 10:13         ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-11-04  8:44 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

On Mon, Nov 04, 2019 at 10:41:10AM +0800, Huang, Ying wrote:

> >> +#define NUMA_SCAN_NR_HIST	16
> >> +		int numa_scan_idx;
> >> +		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
> >> +		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];
> >
> > Why 16? This is 4 cachelines.
> 
> We want to keep the NUMA scanning history reasonably long.  From
> task_scan_min(), the minimal interval between task_numa_work() running
> is about 100 ms by default.  So we can keep 1600 ms history by default
> if NUMA_SCAN_NR_HIST is 16.  If user choose to use smaller
> sysctl_numa_balancing_scan_size, then we can only keep shorter history.
> In general, we want to keep no less than 1000 ms history.  So 16 appears
> like a reasonable choice for us.  Any other suggestion?

This is very good information for Changelogs and comments :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node
  2019-11-04  8:44       ` Peter Zijlstra
@ 2019-11-04 10:13         ` Huang, Ying
  0 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-04 10:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Nov 04, 2019 at 10:41:10AM +0800, Huang, Ying wrote:
>
>> >> +#define NUMA_SCAN_NR_HIST	16
>> >> +		int numa_scan_idx;
>> >> +		unsigned long numa_scan_jiffies[NUMA_SCAN_NR_HIST];
>> >> +		unsigned long numa_scan_starts[NUMA_SCAN_NR_HIST];
>> >
>> > Why 16? This is 4 cachelines.
>> 
>> We want to keep the NUMA scanning history reasonably long.  From
>> task_scan_min(), the minimal interval between task_numa_work() running
>> is about 100 ms by default.  So we can keep 1600 ms history by default
>> if NUMA_SCAN_NR_HIST is 16.  If user choose to use smaller
>> sysctl_numa_balancing_scan_size, then we can only keep shorter history.
>> In general, we want to keep no less than 1000 ms history.  So 16 appears
>> like a reasonable choice for us.  Any other suggestion?
>
> This is very good information for Changelogs and comments :-)

Thanks!  Will do this in the next version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC 09/10] autonuma, memory tiering: Double hot threshold for write hint page fault
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (7 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  7:57 ` [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
  9 siblings, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

The write performance of PMEM is much worse than its read
performance.  So even if a write-mostly pages is colder than a
read-mostly pages, it is usually better to put the write-mostly pages
in DRAM and read-mostly pages in PMEM.

To give write-mostly pages more opportunity to be promoted to DRAM, in
this patch, the hot threshold for write hint page fault is
doubled (easier to be promoted).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/sched/numa_balancing.h | 1 +
 kernel/sched/fair.c                  | 2 ++
 mm/memory.c                          | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index e1c2728d5bb2..65dc8c6e8377 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -15,6 +15,7 @@
 #define TNF_FAULT_LOCAL	0x08
 #define TNF_MIGRATE_FAIL 0x10
 #define TNF_YOUNG	0x20
+#define TNF_WRITE	0x40
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6cf5832556e..0a83e9cf6685 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1513,6 +1513,8 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 
 		threshold = msecs_to_jiffies(
 			sysctl_numa_balancing_hot_threshold);
+		if (flags & TNF_WRITE)
+			threshold *= 2;
 		latency = numa_hint_fault_latency(p, addr);
 		if (latency > threshold)
 			return false;
diff --git a/mm/memory.c b/mm/memory.c
index 80902ff7f5de..fed6c943ba60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3729,6 +3729,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	if (pte_young(old_pte))
 		flags |= TNF_YOUNG;
 
+	if (vmf->flags & FAULT_FLAG_WRITE)
+		flags |= TNF_WRITE;
+
 	page = vm_normal_page(vma, vmf->address, pte);
 	if (!page)
 		goto unmap_out;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
                   ` (8 preceding siblings ...)
  2019-11-01  7:57 ` [RFC 09/10] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
@ 2019-11-01  7:57 ` Huang, Ying
  2019-11-01  9:31   ` Peter Zijlstra
  9 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-01  7:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Huang Ying, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

From: Huang Ying <ying.huang@intel.com>

It isn't easy for the administrator to determine the hot threshold.
So in this patch, a method to adjust the hot threshold automatically
is implemented.  The basic idea is to control the number of the
candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we
will try to promote the page, that is, the page is the candidate
promotion page.

If the number of the candidate promotion pages in the statistics
interval is much higher than the promotion rate limit, the hot
threshold will be lowered to reduce the number of the candidate
promotion pages.  Otherwise, the hot threshold will be raised to
increase the number of the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and
the hot/cold distribution need to be stable.  Because the page tables
are scanned linearly in autonuma, but the hot/cold distribution isn't
uniform along the address.  The statistics interval should be larger
than the autonuma scan period.  So in the patch, the max scan period
is used as statistics interval and it works well in our tests.

The sysctl knob kernel.numa_balancing_hot_threshold_ms becomes the
initial value and max value of the hot threshold.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
5.5% with 24.6% fewer NUMA page migrations on a 2 socket Intel server
with Optance DC Persistent Memory.  Because it improves the accuracy
of the hot page selection.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mmzone.h |  3 +++
 kernel/sched/fair.c    | 48 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46382b058546..afd56541252c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -772,6 +772,9 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned long autonuma_jiffies;
 	unsigned long autonuma_try_migrate;
+	unsigned long autonuma_threshold_jiffies;
+	unsigned long autonuma_threshold_try_migrate;
+	unsigned long autonuma_threshold;
 #endif
 	/* Fields commonly accessed by the page reclaim scanner */
 	struct lruvec		lruvec;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a83e9cf6685..22bdbb7afac2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1486,6 +1486,41 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
 	return true;
 }
 
+#define NUMA_MIGRATION_ADJUST_STEPS	16
+
+static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
+					    unsigned long rate_limit,
+					    unsigned long ref_threshold)
+{
+	unsigned long now = jiffies, last_threshold_jiffies;
+	unsigned long unit_threshold, threshold;
+	unsigned long try_migrate, ref_try_migrate, mdiff;
+
+	last_threshold_jiffies = pgdat->autonuma_threshold_jiffies;
+	if (now > last_threshold_jiffies +
+	    msecs_to_jiffies(sysctl_numa_balancing_scan_period_max) &&
+	    cmpxchg(&pgdat->autonuma_threshold_jiffies,
+		    last_threshold_jiffies, now) == last_threshold_jiffies) {
+
+		ref_try_migrate = rate_limit *
+			sysctl_numa_balancing_scan_period_max / 1000;
+		try_migrate = node_page_state(pgdat, NUMA_TRY_MIGRATE);
+		mdiff = try_migrate - pgdat->autonuma_threshold_try_migrate;
+		unit_threshold = ref_threshold / NUMA_MIGRATION_ADJUST_STEPS;
+		threshold = pgdat->autonuma_threshold;
+		if (!threshold)
+			threshold = ref_threshold;
+		if (mdiff > ref_try_migrate * 11 / 10)
+			threshold = max(threshold - unit_threshold,
+					unit_threshold);
+		else if (mdiff < ref_try_migrate * 9 / 10)
+			threshold = min(threshold + unit_threshold,
+					ref_threshold);
+		pgdat->autonuma_threshold_try_migrate = try_migrate;
+		pgdat->autonuma_threshold = threshold;
+	}
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 				int src_nid, int dst_cpu, unsigned long addr,
 				int flags)
@@ -1501,7 +1536,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 	    next_promotion_node(src_nid) != -1) {
 		struct pglist_data *pgdat;
-		unsigned long rate_limit, latency, threshold;
+		unsigned long rate_limit, latency, threshold, def_threshold;
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat))
@@ -1511,16 +1546,21 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 		if (!(flags & TNF_YOUNG))
 			return false;
 
-		threshold = msecs_to_jiffies(
+		def_threshold = msecs_to_jiffies(
 			sysctl_numa_balancing_hot_threshold);
+		rate_limit = sysctl_numa_balancing_rate_limit <<
+			(20 - PAGE_SHIFT);
+		numa_migration_adjust_threshold(pgdat, rate_limit,
+						def_threshold);
+
+		threshold = pgdat->autonuma_threshold;
+		threshold = threshold ? : def_threshold;
 		if (flags & TNF_WRITE)
 			threshold *= 2;
 		latency = numa_hint_fault_latency(p, addr);
 		if (latency > threshold)
 			return false;
 
-		rate_limit = sysctl_numa_balancing_rate_limit <<
-			(20 - PAGE_SHIFT);
 		return numa_migration_check_rate_limit(pgdat, rate_limit,
 						       hpage_nr_pages(page));
 	}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-01  7:57 ` [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
@ 2019-11-01  9:31   ` Peter Zijlstra
  2019-11-04  6:11     ` Huang, Ying
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-11-01  9:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

On Fri, Nov 01, 2019 at 03:57:27PM +0800, Huang, Ying wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a83e9cf6685..22bdbb7afac2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1486,6 +1486,41 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
>  	return true;
>  }
>  
> +#define NUMA_MIGRATION_ADJUST_STEPS	16
> +
> +static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
> +					    unsigned long rate_limit,
> +					    unsigned long ref_threshold)
> +{
> +	unsigned long now = jiffies, last_threshold_jiffies;
> +	unsigned long unit_threshold, threshold;
> +	unsigned long try_migrate, ref_try_migrate, mdiff;
> +
> +	last_threshold_jiffies = pgdat->autonuma_threshold_jiffies;
> +	if (now > last_threshold_jiffies +
> +	    msecs_to_jiffies(sysctl_numa_balancing_scan_period_max) &&
> +	    cmpxchg(&pgdat->autonuma_threshold_jiffies,
> +		    last_threshold_jiffies, now) == last_threshold_jiffies) {

That is seriously unreadable gunk.

> +
> +		ref_try_migrate = rate_limit *
> +			sysctl_numa_balancing_scan_period_max / 1000;
> +		try_migrate = node_page_state(pgdat, NUMA_TRY_MIGRATE);
> +		mdiff = try_migrate - pgdat->autonuma_threshold_try_migrate;
> +		unit_threshold = ref_threshold / NUMA_MIGRATION_ADJUST_STEPS;
> +		threshold = pgdat->autonuma_threshold;
> +		if (!threshold)
> +			threshold = ref_threshold;
> +		if (mdiff > ref_try_migrate * 11 / 10)
> +			threshold = max(threshold - unit_threshold,
> +					unit_threshold);
> +		else if (mdiff < ref_try_migrate * 9 / 10)
> +			threshold = min(threshold + unit_threshold,
> +					ref_threshold);

And that is violating codingstyle.

> +		pgdat->autonuma_threshold_try_migrate = try_migrate;
> +		pgdat->autonuma_threshold = threshold;
> +	}
> +}


Maybe if you use variable names that are slightly shorter than half your
line length?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-01  9:31   ` Peter Zijlstra
@ 2019-11-04  6:11     ` Huang, Ying
  2019-11-04  8:49       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Huang, Ying @ 2019-11-04  6:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

Peter Zijlstra <peterz@infradead.org> writes:

> On Fri, Nov 01, 2019 at 03:57:27PM +0800, Huang, Ying wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0a83e9cf6685..22bdbb7afac2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1486,6 +1486,41 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
>>  	return true;
>>  }
>>  
>> +#define NUMA_MIGRATION_ADJUST_STEPS	16
>> +
>> +static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
>> +					    unsigned long rate_limit,
>> +					    unsigned long ref_threshold)
>> +{
>> +	unsigned long now = jiffies, last_threshold_jiffies;
>> +	unsigned long unit_threshold, threshold;
>> +	unsigned long try_migrate, ref_try_migrate, mdiff;
>> +
>> +	last_threshold_jiffies = pgdat->autonuma_threshold_jiffies;
>> +	if (now > last_threshold_jiffies +
>> +	    msecs_to_jiffies(sysctl_numa_balancing_scan_period_max) &&
>> +	    cmpxchg(&pgdat->autonuma_threshold_jiffies,
>> +		    last_threshold_jiffies, now) == last_threshold_jiffies) {
>
> That is seriously unreadable gunk.

The basic idea here is to adjust hot threshold every
sysctl_numa_balancing_scan_period_max.  Because the application
accessing pattern may have spatial locality, and autonuma scans address
space linearly.  In general, the statistics of NUMA hint page fault
latency isn't stable in arbitrary internal (such as 1 s, or 200 ms,
etc).  But the scanning of the whole address space of the application is
expected to be finished within sysctl_numa_balancing_scan_period_max.
The statistics of NUMA hint page fault latency is expected to be much
more stable for the whole application.  So we choose to adjust hot
threshold every sysctl_numa_balancing_scan_period_max.

I will add comments for this in the next version.

>> +
>> +		ref_try_migrate = rate_limit *
>> +			sysctl_numa_balancing_scan_period_max / 1000;
>> +		try_migrate = node_page_state(pgdat, NUMA_TRY_MIGRATE);
>> +		mdiff = try_migrate - pgdat->autonuma_threshold_try_migrate;
>> +		unit_threshold = ref_threshold / NUMA_MIGRATION_ADJUST_STEPS;
>> +		threshold = pgdat->autonuma_threshold;
>> +		if (!threshold)
>> +			threshold = ref_threshold;
>> +		if (mdiff > ref_try_migrate * 11 / 10)
>> +			threshold = max(threshold - unit_threshold,
>> +					unit_threshold);
>> +		else if (mdiff < ref_try_migrate * 9 / 10)
>> +			threshold = min(threshold + unit_threshold,
>> +					ref_threshold);
>
> And that is violating codingstyle.

You mean I need to use braces here?

>> +		pgdat->autonuma_threshold_try_migrate = try_migrate;
>> +		pgdat->autonuma_threshold = threshold;
>> +	}
>> +}
>
>
> Maybe if you use variable names that are slightly shorter than half your
> line length?

Sure.  I will use shorter variable name in the next version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-04  6:11     ` Huang, Ying
@ 2019-11-04  8:49       ` Peter Zijlstra
  2019-11-04 10:12         ` Huang, Ying
  2019-11-21  8:38         ` Huang, Ying
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-11-04  8:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

On Mon, Nov 04, 2019 at 02:11:11PM +0800, Huang, Ying wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Fri, Nov 01, 2019 at 03:57:27PM +0800, Huang, Ying wrote:
> >
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 0a83e9cf6685..22bdbb7afac2 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -1486,6 +1486,41 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
> >>  	return true;
> >>  }
> >>  
> >> +#define NUMA_MIGRATION_ADJUST_STEPS	16
> >> +
> >> +static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
> >> +					    unsigned long rate_limit,
> >> +					    unsigned long ref_threshold)
> >> +{
> >> +	unsigned long now = jiffies, last_threshold_jiffies;
> >> +	unsigned long unit_threshold, threshold;
> >> +	unsigned long try_migrate, ref_try_migrate, mdiff;
> >> +
> >> +	last_threshold_jiffies = pgdat->autonuma_threshold_jiffies;
> >> +	if (now > last_threshold_jiffies +
> >> +	    msecs_to_jiffies(sysctl_numa_balancing_scan_period_max) &&
> >> +	    cmpxchg(&pgdat->autonuma_threshold_jiffies,
> >> +		    last_threshold_jiffies, now) == last_threshold_jiffies) {
> >
> > That is seriously unreadable gunk.
> 
> The basic idea here is to adjust hot threshold every

Oh, I figured out what it does, but it's just really hard to read
because of those silly variable names.

This was just a first quick read through of the patches, and stuff like
this annoys me no end. I did start a rewrite with more sensible variable
names, but figured this might not be time for that.

I still need to think and review the whole concept in more detail, now
that I've read the patches. But I need to chase regressions first :/

FWIW, can you post a SLIT / NUMA distance table for such a system?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-04  8:49       ` Peter Zijlstra
@ 2019-11-04 10:12         ` Huang, Ying
  2019-11-21  8:38         ` Huang, Ying
  1 sibling, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-04 10:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Nov 04, 2019 at 02:11:11PM +0800, Huang, Ying wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Fri, Nov 01, 2019 at 03:57:27PM +0800, Huang, Ying wrote:
>> >
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 0a83e9cf6685..22bdbb7afac2 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -1486,6 +1486,41 @@ static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
>> >>  	return true;
>> >>  }
>> >>  
>> >> +#define NUMA_MIGRATION_ADJUST_STEPS	16
>> >> +
>> >> +static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
>> >> +					    unsigned long rate_limit,
>> >> +					    unsigned long ref_threshold)
>> >> +{
>> >> +	unsigned long now = jiffies, last_threshold_jiffies;
>> >> +	unsigned long unit_threshold, threshold;
>> >> +	unsigned long try_migrate, ref_try_migrate, mdiff;
>> >> +
>> >> +	last_threshold_jiffies = pgdat->autonuma_threshold_jiffies;
>> >> +	if (now > last_threshold_jiffies +
>> >> +	    msecs_to_jiffies(sysctl_numa_balancing_scan_period_max) &&
>> >> +	    cmpxchg(&pgdat->autonuma_threshold_jiffies,
>> >> +		    last_threshold_jiffies, now) == last_threshold_jiffies) {
>> >
>> > That is seriously unreadable gunk.
>> 
>> The basic idea here is to adjust hot threshold every
>
> Oh, I figured out what it does, but it's just really hard to read
> because of those silly variable names.
>
> This was just a first quick read through of the patches, and stuff like
> this annoys me no end. I did start a rewrite with more sensible variable
> names, but figured this might not be time for that.

Sorry about the poor naming.  That is always hard for me.

> I still need to think and review the whole concept in more detail, now
> that I've read the patches. But I need to chase regressions first :/

Thanks for your help!

> FWIW, can you post a SLIT / NUMA distance table for such a system?

Sure.  Will send you as attachment in another email.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically
  2019-11-04  8:49       ` Peter Zijlstra
  2019-11-04 10:12         ` Huang, Ying
@ 2019-11-21  8:38         ` Huang, Ying
  1 sibling, 0 replies; 22+ messages in thread
From: Huang, Ying @ 2019-11-21  8:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Rik van Riel, Mel Gorman, Ingo Molnar, Dave Hansen, Dan Williams,
	Fengguang Wu

Hi, Peter,

Peter Zijlstra <peterz@infradead.org> writes:

> I still need to think and review the whole concept in more detail, now
> that I've read the patches. But I need to chase regressions first :/

Have you found some time to review the concept?

Just want to check the status.  Please don't regard this as pushing.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-11-21  8:38 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-01  7:57 [RFC 00/10] autonuma: Optimize memory placement in memory tiering system Huang, Ying
2019-11-01  7:57 ` [RFC 01/10] autonuma: Fix watermark checking in migrate_balanced_pgdat() Huang, Ying
2019-11-01 11:11   ` Mel Gorman
2019-11-01  7:57 ` [RFC 02/10] autonuma: Reduce cache footprint when scanning page tables Huang, Ying
2019-11-01 11:13   ` Mel Gorman
2019-11-01  7:57 ` [RFC 03/10] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode Huang, Ying
2019-11-01  7:57 ` [RFC 04/10] autonuma, memory tiering: Rate limit NUMA migration throughput Huang, Ying
2019-11-01  7:57 ` [RFC 05/10] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Huang, Ying
2019-11-01  7:57 ` [RFC 06/10] autonuma, memory tiering: Skip to scan fastest memory Huang, Ying
2019-11-01  7:57 ` [RFC 07/10] autonuma, memory tiering: Only promote page if accessed twice Huang, Ying
2019-11-01  7:57 ` [RFC 08/10] autonuma, memory tiering: Select hotter pages to promote to fast memory node Huang, Ying
2019-11-01  9:24   ` Peter Zijlstra
2019-11-04  2:41     ` Huang, Ying
2019-11-04  8:44       ` Peter Zijlstra
2019-11-04 10:13         ` Huang, Ying
2019-11-01  7:57 ` [RFC 09/10] autonuma, memory tiering: Double hot threshold for write hint page fault Huang, Ying
2019-11-01  7:57 ` [RFC 10/10] autonuma, memory tiering: Adjust hot threshold automatically Huang, Ying
2019-11-01  9:31   ` Peter Zijlstra
2019-11-04  6:11     ` Huang, Ying
2019-11-04  8:49       ` Peter Zijlstra
2019-11-04 10:12         ` Huang, Ying
2019-11-21  8:38         ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).