linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy
@ 2019-06-13 23:29 Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 1/9] mm: define N_CPU_MEM node states Yang Shi
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel


With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now.  But, how to use PMEM as NUMA
node effectively and efficiently is worth exploring. 

There have been a couple of proposals posted on the mailing list [1] [2] [3].

I already posted two versions of patchset for demoting/promoting memory pages
between DRAM and PMEM before this topic was discussed at LSF/MM 2019
(https://lwn.net/Articles/787418/).  I do appreciate all the great suggestions
from the community.  This updated version implemented the most discussion,
please see the below design section for the details.


Changelog
=========
v2 --> v3:
* Introduced "migrate mode" for node reclaim.  Just do demotion when
  "migrate mode" is specified per Michal Hocko and Mel Gorman.
* Introduced "migrate target" concept for VM per Mel Gorman.  The memory nodes
  which are under DRAM in the hierarchy (i.e. lower bandwidth, higher latency,
  larger capacity and cheaper than DRAM) are considered as "migrate target"
  nodes.  When "migrate mode" is on, memory reclaim would demote pages to
  the "migrate target" nodes.
* Dropped "twice access" promotion patch per Michal Hocko.
* Changed the subject for the patchset to reflect the update.
* Rebased to 5.2-rc1.

v1 --> v2:
* Dropped the default allocation node mask.  The memory placement restriction
  could be achieved by mempolicy or cpuset.
* Dropped the new mempolicy since its semantic is not that clear yet.
* Dropped PG_Promote flag.
* Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
* Extended page_check_references() to implement "twice access" check for
  anonymous page in NUMA balancing path.
* Reworked the memory demotion code.

v2: https://lore.kernel.org/linux-mm/1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com/
v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/


Design
======
With the development of new memory technology, we could have cheaper and
larger memory device on the system, which may have higher latency and lower
bandwidth than DRAM, i.e. PMEM.  It could be used as persistent storage or
volatile memory.

It fits into the memory hierarchy as a second tier memory.  The patchset
tries to explore an approach to utilize such memory to improve the memory
placement.  Basically, the patchset tries to achieve this goal by doing
memory promotion/demotion via NUMA balancing and memory reclaim.

Introduce a new "migrate" mode for node reclaim.  When DRAM has memory
pressure, demote pages to PMEM via node reclaim path if "migrate" mode is
on.  Then NUMA balancing will promote pages to DRAM as long as the page is
referenced again.  The memory pressure on PMEM node would push the inactive
pages of PMEM to disk via swap.

Introduce "primary" node and "migrate target" node concepts for VM (patch 1/9
and 2/9).  The "primary" node is the node which has both CPU and memory.  The
"migrate target" node is cpuless node and under DRAM in memory hierarchy
(i.e. PMEM may be a suitable one, which has lower bandwidth, higher latency,
larger capacity and is cheaper than DRAM).  The firmware is effectively going
to enforce "cpu-less" nodes for any memory range that has differentiated
performance from the conventional memory pool, or differentiated performance
for a specific initiator.

Defined "N_CPU_MEM" nodemask for the "primary" nodes in order to distinguish
with cpuless nodes (memory only, i.e. PMEM nodes) and memoryless nodes (some
architectures, i.e. Power, may have memoryless nodes).

It is a little bit hard to find out suitable "migrate target" node since this
needs firmware exposes the physical characteristics of the memory devices.
I'm not quite sure what should be the best way and if it is ready to use now
or not.  Since PMEM is the only available such device for now, so it sounds
retrieving the information from SRAT is the easiest way.  We may figure out a
better way in the future.

The promotion/demotion happens only between "primary" nodes and "migrate target"
nodes.  No promotion/demotion between "migrate target" nodes and promotion from
"primary" nodes to "migrate target" nodes and demotion from "primary" nodes to
"migrate target" nodes.  This guarantees there is no cycles for memory demotion
or promotion.

According to the discussion at LFS/MM 2019, "there should only be one node to
which pages could be migrated".   So reclaim code just tries to demote the pages
to the closest "migrate target" node and only tries once.  Otherwise "if all
nodes in the system were on a fallback list, a page would have to move through
every possible option - each RAM-based node and each persistent-memory node -
before actually being reclaimed. It would be necessary to maintain the history
of where each page has been, and would be likely to disrupt other workloads on
the system".  This is what v2 patchset does, so keep doing it in the same way
in v3.

The demotion code moves all the migration candidate pages into one single list,
then migrate them together (including THP).  This would improve the efficiency
of migration according to Zi Yan's research.  If the migration fails, the
unmigrated pages will be put back to LRU.

Use the most opotimistic GFP flags to allocate pages on the "migrate target"
node.
 
To reduce the failure rate of demotion, check if the "migrate target" node is
contended or not.  If the "migrate target" node is contended, just do swap
instead of migrate.  If migration is failed due to -ENOMEM, mark the node as
contended.  The contended flag will be cleared once the node get balanced.

For now "migrate" mode is not compatible with cpuset and mempolicy since it
is hard to get the process's task_struct from struct page.  The cpuset and
process's mempolicy are stored in task_struct instead of mm_struct.

Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache.  Page cache can be demoted easily, but promotion is a
question, may do it via mark_page_accessed().

Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
numa_pages_promoted.

There are definitely still a lot of details need to be sorted out.  Any
comment is welcome.


Test
====
The stress test was done with mmtests + applications workload (i.e. sysbench,
grep, etc).

Generate memory pressure by running mmtest's usemem-stress-numa-compact,
then run other applications as workload to stress the promotion and demotion
path.  The machine was still alive after the stress test had been running for
~30 hours.  The /proc/vmstat also shows:

...
pgdemote_kswapd 3316563
pgdemote_direct 1930721
...
numa_pages_promoted 81838


[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
[3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@dhcp22.suse.cz/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d


Yang Shi (9):
      mm: define N_CPU_MEM node states
      mm: Introduce migrate target nodemask
      mm: page_alloc: make find_next_best_node find return migration target node
      mm: migrate: make migrate_pages() return nr_succeeded
      mm: vmscan: demote anon DRAM pages to migration target node
      mm: vmscan: don't demote for memcg reclaim
      mm: vmscan: check if the demote target node is contended or not
      mm: vmscan: add page demotion counter
      mm: numa: add page promotion counter

 Documentation/sysctl/vm.txt    |   6 +++
 drivers/acpi/numa.c            |  12 +++++
 drivers/base/node.c            |   4 ++
 include/linux/gfp.h            |  12 +++++
 include/linux/migrate.h        |   6 ++-
 include/linux/mmzone.h         |   3 ++
 include/linux/nodemask.h       |   4 +-
 include/linux/vm_event_item.h  |   3 ++
 include/linux/vmstat.h         |   1 +
 include/trace/events/migrate.h |   3 +-
 mm/compaction.c                |   3 +-
 mm/debug.c                     |   1 +
 mm/gup.c                       |   4 +-
 mm/huge_memory.c               |   4 ++
 mm/internal.h                  |  23 ++++++++
 mm/memory-failure.c            |   7 ++-
 mm/memory.c                    |   4 ++
 mm/memory_hotplug.c            |  10 +++-
 mm/mempolicy.c                 |   7 ++-
 mm/migrate.c                   |  33 ++++++++----
 mm/page_alloc.c                |  20 +++++--
 mm/vmscan.c                    | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 mm/vmstat.c                    |  14 ++++-
 23 files changed, 323 insertions(+), 47 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [v3 PATCH 1/9] mm: define N_CPU_MEM node states
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 2/9] mm: Introduce migrate target nodemask Yang Shi
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

Kernel has some pre-defined node masks called node states, i.e.
N_MEMORY, N_CPU, etc.  But, there might be cpuless nodes, i.e. PMEM
nodes, and some architectures, i.e. Power, may have memoryless nodes.
It is not very straight forward to get the nodes with both CPUs and
memory.  So, define N_CPU_MEMORY node states.  The nodes with both CPUs
and memory are called "primary" nodes.  /sys/devices/system/node/primary
would show the current online "primary" nodes.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 drivers/base/node.c      |  2 ++
 include/linux/nodemask.h |  3 ++-
 mm/memory_hotplug.c      |  6 ++++++
 mm/page_alloc.c          |  1 +
 mm/vmstat.c              | 11 +++++++++--
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8598fcb..4d80fc8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -984,6 +984,7 @@ static ssize_t show_node_state(struct device *dev,
 #endif
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
+	[N_CPU_MEM] = _NODE_ATTR(primary, N_CPU_MEM),
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -995,6 +996,7 @@ static ssize_t show_node_state(struct device *dev,
 #endif
 	&node_state_attr[N_MEMORY].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
+	&node_state_attr[N_CPU_MEM].attr.attr,
 	NULL
 };
 
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 27e7fa3..66a8964 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -398,7 +398,8 @@ enum node_states {
 	N_HIGH_MEMORY = N_NORMAL_MEMORY,
 #endif
 	N_MEMORY,		/* The node has memory(regular, high, movable) */
-	N_CPU,		/* The node has one or more cpus */
+	N_CPU,			/* The node has one or more cpus */
+	N_CPU_MEM,		/* The node has both cpus and memory */
 	NR_NODE_STATES
 };
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 328878b..7c29282 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -709,6 +709,9 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 
 	if (arg->status_change_nid >= 0)
 		node_set_state(node, N_MEMORY);
+
+	if (node_state(node, N_CPU))
+		node_set_state(node, N_CPU_MEM);
 }
 
 static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn,
@@ -1526,6 +1529,9 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
 
 	if (arg->status_change_nid >= 0)
 		node_clear_state(node, N_MEMORY);
+
+	if (node_state(node, N_CPU))
+		node_clear_state(node, N_CPU_MEM);
 }
 
 static int __ref __offline_pages(unsigned long start_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b13d39..757db89e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -124,6 +124,7 @@ struct pcpu_drain {
 #endif
 	[N_MEMORY] = { { [0] = 1UL } },
 	[N_CPU] = { { [0] = 1UL } },
+	[N_CPU_MEM] = { { [0] = 1UL } },
 #endif	/* NUMA */
 };
 EXPORT_SYMBOL(node_states);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a7d4933..d876ac0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1905,15 +1905,22 @@ static void __init init_cpu_node_state(void)
 	int node;
 
 	for_each_online_node(node) {
-		if (cpumask_weight(cpumask_of_node(node)) > 0)
+		if (cpumask_weight(cpumask_of_node(node)) > 0) {
 			node_set_state(node, N_CPU);
+			if (node_state(node, N_MEMORY))
+				node_set_state(node, N_CPU_MEM);
+		}
 	}
 }
 
 static int vmstat_cpu_online(unsigned int cpu)
 {
+	int node = cpu_to_node(cpu);
+
 	refresh_zone_stat_thresholds();
-	node_set_state(cpu_to_node(cpu), N_CPU);
+	node_set_state(node, N_CPU);
+	if (node_state(node, N_MEMORY))
+		node_set_state(node, N_CPU_MEM);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 2/9] mm: Introduce migrate target nodemask
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 1/9] mm: define N_CPU_MEM node states Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 3/9] mm: page_alloc: make find_next_best_node find return migration target node Yang Shi
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

With more memory types are invented, the system may have heterogeneous
memory hierarchy, i.e. DRAM and PMEM.  Some of them are cheaper and
slower than DRAM, may be good candidates to be used as secondary memory
to store not recently or frequently used data.

Introduce the "migrate target" nodemask for such memory nodes.  The
migrate target could be any memory types which are cheaper and/or
slower than DRAM.  Currently PMEM is one of such memory.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 drivers/acpi/numa.c      | 12 ++++++++++++
 drivers/base/node.c      |  2 ++
 include/linux/nodemask.h |  1 +
 mm/page_alloc.c          |  1 +
 4 files changed, 16 insertions(+)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 3099583..f75adba 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -296,6 +296,18 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
 		goto out_err_bad_srat;
 	}
 
+	/*
+	 * The system may have memory hierarchy, some memory may be good
+	 * candidate for migration target, i.e. PMEM is one of them.  Mark
+	 * such memory as migration target.
+	 *
+	 * It may be better to retrieve such information from HMAT, but
+	 * SRAT sounds good enough for now.  May switch to HMAT in the
+	 * future.
+	 */ 
+	if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE)
+		node_set_state(node, N_MIGRATE_TARGET);
+
 	node_set(node, numa_nodes_parsed);
 
 	pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 4d80fc8..351b694 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -985,6 +985,7 @@ static ssize_t show_node_state(struct device *dev,
 	[N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY),
 	[N_CPU] = _NODE_ATTR(has_cpu, N_CPU),
 	[N_CPU_MEM] = _NODE_ATTR(primary, N_CPU_MEM),
+	[N_MIGRATE_TARGET] = _NODE_ATTR(migrate_target, N_MIGRATE_TARGET),
 };
 
 static struct attribute *node_state_attrs[] = {
@@ -997,6 +998,7 @@ static ssize_t show_node_state(struct device *dev,
 	&node_state_attr[N_MEMORY].attr.attr,
 	&node_state_attr[N_CPU].attr.attr,
 	&node_state_attr[N_CPU_MEM].attr.attr,
+	&node_state_attr[N_MIGRATE_TARGET].attr.attr,
 	NULL
 };
 
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 66a8964..411618c 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -400,6 +400,7 @@ enum node_states {
 	N_MEMORY,		/* The node has memory(regular, high, movable) */
 	N_CPU,			/* The node has one or more cpus */
 	N_CPU_MEM,		/* The node has both cpus and memory */
+	N_MIGRATE_TARGET,	/* The node is suitable migrate target */
 	NR_NODE_STATES
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 757db89e..3b37c71 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -125,6 +125,7 @@ struct pcpu_drain {
 	[N_MEMORY] = { { [0] = 1UL } },
 	[N_CPU] = { { [0] = 1UL } },
 	[N_CPU_MEM] = { { [0] = 1UL } },
+	[N_MIGRATE_TARGET] = { { [0] = 1UL } },
 #endif	/* NUMA */
 };
 EXPORT_SYMBOL(node_states);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 3/9] mm: page_alloc: make find_next_best_node find return migration target node
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 1/9] mm: define N_CPU_MEM node states Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 2/9] mm: Introduce migrate target nodemask Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded Yang Shi
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

Need find the cloest migration target node to demote DRAM pages.  Add
"migration" parameter to find_next_best_node() to skip DRAM node on
demand.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/internal.h   | 11 +++++++++++
 mm/page_alloc.c | 14 ++++++++++----
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b..a3181e2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -292,6 +292,17 @@ static inline bool is_data_mapping(vm_flags_t flags)
 	return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
 }
 
+#ifdef CONFIG_NUMA
+extern int find_next_best_node(int node, nodemask_t *used_node_mask,
+			       bool migration);
+#else
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
+				      bool migtation)
+{
+	return 0;
+}
+#endif
+
 /* mm/util.c */
 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b37c71..917f64d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5425,6 +5425,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
  * find_next_best_node - find the next node that should appear in a given node's fallback list
  * @node: node whose fallback list we're appending
  * @used_node_mask: nodemask_t of already used nodes
+ * @migration: find next best migration target node
  *
  * We use a number of factors to determine which is the next node that should
  * appear on a given node's fallback list.  The node should not have appeared
@@ -5436,7 +5437,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask,
+			bool migration)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -5444,13 +5446,18 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	const struct cpumask *tmp = cpumask_of_node(0);
 
 	/* Use the local node if we haven't already */
-	if (!node_isset(node, *used_node_mask)) {
+	if (!node_isset(node, *used_node_mask) &&
+	    !migration) {
 		node_set(node, *used_node_mask);
 		return node;
 	}
 
 	for_each_node_state(n, N_MEMORY) {
 
+		/* Find next best migration target node */
+		if (migration && !node_state(n, N_MIGRATE_TARGET))
+			continue;
+
 		/* Don't want a node to appear more than once */
 		if (node_isset(n, *used_node_mask))
 			continue;
@@ -5482,7 +5489,6 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	return best_node;
 }
 
-
 /*
  * Build zonelists ordered by node and zones within node.
  * This results in maximum locality--normal zone overflows into local
@@ -5544,7 +5550,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+	while ((node = find_next_best_node(local_node, &used_mask, false)) >= 0) {
 		/*
 		 * We don't want to pressure a particular node.
 		 * So adding penalty to the first node in same
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (2 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 3/9] mm: page_alloc: make find_next_best_node find return migration target node Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 5/9] mm: vmscan: demote anon DRAM pages to migration target node Yang Shi
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

The migrate_pages() returns the number of pages that were not migrated,
or an error code.  When returning an error code, there is no way to know
how many pages were migrated or not migrated.

In the following patch, migrate_pages() is used to demote pages to PMEM
node, we need account how many pages are reclaimed (demoted) since page
reclaim behavior depends on this.  Add *nr_succeeded parameter to make
migrate_pages() return how many pages are demoted successfully for all
cases.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/migrate.h |  5 +++--
 mm/compaction.c         |  3 ++-
 mm/gup.c                |  4 +++-
 mm/memory-failure.c     |  7 +++++--
 mm/memory_hotplug.c     |  4 +++-
 mm/mempolicy.c          |  7 +++++--
 mm/migrate.c            | 18 ++++++++++--------
 mm/page_alloc.c         |  4 +++-
 8 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf..837fdd1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -66,7 +66,8 @@ extern int migrate_page(struct address_space *mapping,
 			struct page *newpage, struct page *page,
 			enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
-		unsigned long private, enum migrate_mode mode, int reason);
+		unsigned long private, enum migrate_mode mode, int reason,
+		unsigned int *nr_succeeded);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
@@ -84,7 +85,7 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
-		int reason)
+		int reason, unsigned int *nr_succeeded)
 	{ return -ENOSYS; }
 static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 9febc8c..c1723e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2074,6 +2074,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 	unsigned long last_migrated_pfn;
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 	bool update_cached;
+	unsigned int nr_succeeded = 0;
 
 	cc->migratetype = gfpflags_to_migratetype(cc->gfp_mask);
 	ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
@@ -2182,7 +2183,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
-				MR_COMPACTION);
+				MR_COMPACTION, &nr_succeeded);
 
 		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
 							&cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index 2c08248..446ce25 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1337,6 +1337,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 	long i;
 	bool drain_allow = true;
 	bool migrate_allow = true;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(cma_page_list);
 
 check_again:
@@ -1377,7 +1378,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 			put_page(pages[i]);
 
 		if (migrate_pages(&cma_page_list, new_non_cma_page,
-				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) {
+				  NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE,
+				  &nr_succeeded)) {
 			/*
 			 * some of the pages failed migration. Do get_user_pages
 			 * without migration.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..b5d8a8f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1686,6 +1686,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(pagelist);
 
 	/*
@@ -1713,7 +1714,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	}
 
 	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-				MIGRATE_SYNC, MR_MEMORY_FAILURE);
+				MIGRATE_SYNC, MR_MEMORY_FAILURE, &nr_succeeded);
 	if (ret) {
 		pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
 			pfn, ret, page->flags, &page->flags);
@@ -1742,6 +1743,7 @@ static int __soft_offline_page(struct page *page, int flags)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
+	unsigned int nr_succeeded = 0;
 
 	/*
 	 * Check PageHWPoison again inside page lock because PageHWPoison
@@ -1801,7 +1803,8 @@ static int __soft_offline_page(struct page *page, int flags)
 						page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-					MIGRATE_SYNC, MR_MEMORY_FAILURE);
+					MIGRATE_SYNC, MR_MEMORY_FAILURE,
+					&nr_succeeded);
 		if (ret) {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7c29282..1192d08 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1360,6 +1360,7 @@ static struct page *new_node_page(struct page *page, unsigned long private)
 	unsigned long pfn;
 	struct page *page;
 	int ret = 0;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(source);
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -1416,7 +1417,8 @@ static struct page *new_node_page(struct page *page, unsigned long private)
 	if (!list_empty(&source)) {
 		/* Allocate a new page from the nearest neighbor node */
 		ret = migrate_pages(&source, new_node_page, NULL, 0,
-					MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+					MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
+					&nr_succeeded);
 		if (ret) {
 			list_for_each_entry(page, &source, lru) {
 				pr_warn("migrating pfn %lx failed ret:%d ",
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2219e74..b7bc60b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -988,6 +988,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 	nodemask_t nmask;
 	LIST_HEAD(pagelist);
 	int err = 0;
+	unsigned int nr_succeeded = 0;
 
 	nodes_clear(nmask);
 	node_set(source, nmask);
@@ -1003,7 +1004,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest,
-					MIGRATE_SYNC, MR_SYSCALL);
+					MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded);
 		if (err)
 			putback_movable_pages(&pagelist);
 	}
@@ -1182,6 +1183,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	struct mempolicy *new;
 	unsigned long end;
 	int err;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(pagelist);
 
 	if (flags & ~(unsigned long)MPOL_MF_VALID)
@@ -1254,7 +1256,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			WARN_ON_ONCE(flags & MPOL_MF_LAZY);
 			nr_failed = migrate_pages(&pagelist, new_page, NULL,
-				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+				start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
+				&nr_succeeded);
 			if (nr_failed)
 				putback_movable_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index f2ecc28..bc4242a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1392,6 +1392,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
  * @mode:		The migration mode that specifies the constraints for
  *			page migration, if any.
  * @reason:		The reason for page migration.
+ * @nr_succeeded:	The number of pages migrated successfully.
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
@@ -1402,11 +1403,10 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
 		free_page_t put_new_page, unsigned long private,
-		enum migrate_mode mode, int reason)
+		enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
 {
 	int retry = 1;
 	int nr_failed = 0;
-	int nr_succeeded = 0;
 	int pass = 0;
 	struct page *page;
 	struct page *page2;
@@ -1460,7 +1460,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 				retry++;
 				break;
 			case MIGRATEPAGE_SUCCESS:
-				nr_succeeded++;
+				(*nr_succeeded)++;
 				break;
 			default:
 				/*
@@ -1477,11 +1477,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 	nr_failed += retry;
 	rc = nr_failed;
 out:
-	if (nr_succeeded)
-		count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+	if (*nr_succeeded)
+		count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
 	if (nr_failed)
 		count_vm_events(PGMIGRATE_FAIL, nr_failed);
-	trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+	trace_mm_migrate_pages(*nr_succeeded, nr_failed, mode, reason);
 
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
@@ -1506,12 +1506,13 @@ static int do_move_pages_to_node(struct mm_struct *mm,
 		struct list_head *pagelist, int node)
 {
 	int err;
+	unsigned int nr_succeeded = 0;
 
 	if (list_empty(pagelist))
 		return 0;
 
 	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
-			MIGRATE_SYNC, MR_SYSCALL);
+			MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded);
 	if (err)
 		putback_movable_pages(pagelist);
 	return err;
@@ -1944,6 +1945,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated;
 	int nr_remaining;
+	unsigned int nr_succeeded = 0;
 	LIST_HEAD(migratepages);
 
 	/*
@@ -1968,7 +1970,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	list_add(&page->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
 				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+				     MR_NUMA_MISPLACED, &nr_succeeded);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 917f64d..7e95a66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8209,6 +8209,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 	unsigned long pfn = start;
 	unsigned int tries = 0;
 	int ret = 0;
+	unsigned int nr_succeeded = 0;
 
 	migrate_prep();
 
@@ -8236,7 +8237,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 		cc->nr_migratepages -= nr_reclaimed;
 
 		ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
-				    NULL, 0, cc->mode, MR_CONTIG_RANGE);
+				    NULL, 0, cc->mode, MR_CONTIG_RANGE,
+				    &nr_succeeded);
 	}
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 5/9] mm: vmscan: demote anon DRAM pages to migration target node
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (3 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim Yang Shi
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

Since migration target node (i.e. PMEM) typically provides larger
capacity than DRAM and has much lower access latency than disk, so
it is a good choice to use as a middle tier between DRAM and disk in
page reclaim path.

With migration target nodes, the demotion path of anonymous pages could be:

DRAM -> PMEM -> swap device

This patch demotes anonymous pages only for the time being and demote
THP to the migration target node in a whole.  To avoid expensive page
reclaim and/or compaction on the target node if there is memory pressure
on it, the most conservative gfp flag is used, which would fail quickly if
there is memory pressure and just wakeup kswapd on failure.  The
migrate_pages() would split THP to migrate one by one as base page upon
THP allocation failure.

Demote pages to the cloest migration target node even though the system is
swapless.  The current logic of page reclaim just scan anon LRU when
swap is on and swappiness is set properly.  Demoting to the migration
target doesn't need care whether swap is available or not.  But, reclaiming
from the migration target node still skip anon LRU if swap is not available.

The demotion just happens from DRAM node to its cloest migration target node.
Demoting to a remote migration target node or migrating from the target node
to DRAM on reclaim path is not allowed.

And, define a new migration reason for demotion, called MR_DEMOTE.
Demote page via async migration to avoid blocking.

The migration is just allowed via node reclaim.  Introduce a new node
reclaim mode: migrate mode.  The migrate mode is not compatible with
cpuset and mempolicy settings.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/sysctl/vm.txt    |   6 ++
 include/linux/gfp.h            |  12 ++++
 include/linux/migrate.h        |   1 +
 include/trace/events/migrate.h |   3 +-
 mm/debug.c                     |   1 +
 mm/internal.h                  |  12 ++++
 mm/migrate.c                   |  15 +++-
 mm/vmscan.c                    | 157 +++++++++++++++++++++++++++++++++--------
 8 files changed, 175 insertions(+), 32 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7493220..4b76a55 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -919,6 +919,7 @@ This is value ORed together of
 1	= Zone reclaim on
 2	= Zone reclaim writes dirty pages out
 4	= Zone reclaim swaps pages
+8	= Zone reclaim migrate pages
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
 that benefit from having their data cached, zone_reclaim_mode should be
@@ -943,4 +944,9 @@ Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
 
+Allowing zone reclaim to migrate pages to the migration target nodes, which
+are typically cheaper and slower than DRAM, but have larger capacity, i.e.
+NVDIMM nodes, if such nodes are present in the system.  The migrate mode
+is not compatible with cpuset and mempolicy settings.
+
 ============ End of Document =================================
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b50..b294455 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -285,6 +285,14 @@
  * available and will not wake kswapd/kcompactd on failure. The _LIGHT
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
+ *
+ * %GFP_DEMOTE is for migration on memory reclaim (a.k.a demotion) allocations.
+ * The allocation might happen in kswapd or direct reclaim, so assuming
+ * __GFP_IO and __GFP_FS are not allowed looks safer.  Demotion happens for
+ * user pages (on LRU) only and on specific node.  Generally it will fail
+ * quickly if memory is not available, but may wake up kswapd on failure.
+ *
+ * %GFP_TRANSHUGE_DEMOTE is used for THP demotion allocation.
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -300,6 +308,10 @@
 #define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_DEMOTE	(__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | \
+			__GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | \
+			GFP_NOWAIT)
+#define GFP_TRANSHUGE_DEMOTE	(GFP_DEMOTE | __GFP_COMP)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 837fdd1..cfb1f57 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTE,
 	MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d..c1d5b36 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTE,		"demote")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/debug.c b/mm/debug.c
index 8345bb6..0bcced8 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,6 +25,7 @@
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demote",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff --git a/mm/internal.h b/mm/internal.h
index a3181e2..3d756f2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -303,6 +303,18 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
 }
 #endif
 
+static inline bool has_migration_target_node_online(void)
+{
+	int nid;
+
+	for_each_online_node(nid) {
+		if (node_state(nid, N_MIGRATE_TARGET))
+			return true;
+	}
+
+	return false;
+}
+
 /* mm/util.c */
 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
diff --git a/mm/migrate.c b/mm/migrate.c
index bc4242a..9fb76a6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1006,7 +1006,8 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, enum migrate_mode mode)
+				int force, enum migrate_mode mode,
+				enum migrate_reason reason)
 {
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
@@ -1143,8 +1144,16 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (rc == MIGRATEPAGE_SUCCESS) {
 		if (unlikely(!is_lru))
 			put_page(newpage);
-		else
+		else {
+			/*
+			 * Put demoted pages on the target node's
+			 * active LRU.
+			 */
+			if (!PageUnevictable(newpage) &&
+			    reason == MR_DEMOTE)
+				SetPageActive(newpage);
 			putback_lru_page(newpage);
+		}
 	}
 
 	return rc;
@@ -1198,7 +1207,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	rc = __unmap_and_move(page, newpage, force, mode);
+	rc = __unmap_and_move(page, newpage, force, mode, reason);
 	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7acd0af..428a83b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1094,6 +1094,55 @@ static void page_check_dirty_writeback(struct page *page,
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+#ifdef CONFIG_NUMA
+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE (1<<3)	/* Migrate pages to migration target
+				 * node during reclaim */
+static struct page *alloc_demote_page(struct page *page, unsigned long node)
+{
+	if (unlikely(PageHuge(page)))
+		/* HugeTLB demotion is not supported for now */
+		BUG();
+	else if (PageTransHuge(page)) {
+		struct page *thp;
+
+		thp = alloc_pages_node(node, GFP_TRANSHUGE_DEMOTE,
+				       HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	} else
+		return __alloc_pages_node(node, GFP_DEMOTE, 0);
+}
+#else
+static inline struct page *alloc_demote_page(struct page *page,
+					     unsigned long node)
+{
+	return NULL;
+}
+#endif
+
+static inline bool is_demote_ok(int nid)
+{
+	/* Just do demotion with migrate mode of node reclaim */
+	if (!(node_reclaim_mode & RECLAIM_MIGRATE))
+		return false;
+
+	/* Current node is cpuless node */
+	if (!node_state(nid, N_CPU_MEM))
+		return false;
+
+	/* No online migration target node */
+	if (!has_migration_target_node_online())
+		return false;
+
+	return true;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1106,6 +1155,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(demote_pages);
 	unsigned nr_reclaimed = 0;
 	unsigned pgactivate = 0;
 
@@ -1269,6 +1319,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		if (PageAnon(page) && PageSwapBacked(page)) {
 			if (!PageSwapCache(page)) {
+				/*
+				 * Demote anonymous pages only for now and
+				 * skip MADV_FREE pages.
+				 *
+				 * Demotion only happen from primary nodes
+				 * to cpuless nodes.
+				 */
+				if (is_demote_ok(page_to_nid(page))) {
+					list_add(&page->lru, &demote_pages);
+					unlock_page(page);
+					continue;
+				}
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
 				if (PageTransHuge(page)) {
@@ -1480,6 +1542,30 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
+	/* Demote pages to migration target */
+	if (!list_empty(&demote_pages)) {
+		int err, target_nid;
+		unsigned int nr_succeeded = 0;
+		nodemask_t used_mask;
+
+		nodes_clear(used_mask);
+		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
+						 true);
+
+		/* Demotion would ignore all cpuset and mempolicy settings */
+		err = migrate_pages(&demote_pages, alloc_demote_page, NULL,
+				    target_nid, MIGRATE_ASYNC, MR_DEMOTE,
+				    &nr_succeeded);
+
+		nr_reclaimed += nr_succeeded;
+
+		if (err) {
+			putback_movable_pages(&demote_pages);
+
+			list_splice(&ret_pages, &demote_pages);
+		}
+	}
+
 	mem_cgroup_uncharge_list(&free_pages);
 	try_to_unmap_flush();
 	free_unref_page_list(&free_pages);
@@ -2136,10 +2222,11 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	unsigned long gb;
 
 	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
+	 * If we don't have swap space or migtation target node online,
+	 * anonymous page deactivation is pointless.
 	 */
-	if (!file && !total_swap_pages)
+	if (!file && !total_swap_pages &&
+	    !is_demote_ok(pgdat->node_id))
 		return false;
 
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2213,22 +2300,34 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	unsigned long ap, fp;
 	enum lru_list lru;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
-		scan_balance = SCAN_FILE;
-		goto out;
-	}
-
 	/*
-	 * Global reclaim will swap to prevent OOM even with no
-	 * swappiness, but memcg users want to use this knob to
-	 * disable swapping for individual groups completely when
-	 * using the memory controller's swap limit feature would be
-	 * too expensive.
+	 * Anon pages can be demoted to PMEM. If there is PMEM node online,
+	 * still scan anonymous LRU even though the systme is swapless or
+	 * swapping is disabled by memcg.
+	 *
+	 * If current node is already PMEM node, demotion is not applicable.
 	 */
-	if (!global_reclaim(sc) && !swappiness) {
-		scan_balance = SCAN_FILE;
-		goto out;
+	if (!is_demote_ok(pgdat->node_id)) {
+		/*
+		 * If we have no swap space, do not bother scanning
+		 * anon pages.
+		 */
+		if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+			scan_balance = SCAN_FILE;
+			goto out;
+		}
+
+		/*
+		 * Global reclaim will swap to prevent OOM even with no
+		 * swappiness, but memcg users want to use this knob to
+		 * disable swapping for individual groups completely when
+		 * using the memory controller's swap limit feature would be
+		 * too expensive.
+		 */
+		if (!global_reclaim(sc) && !swappiness) {
+			scan_balance = SCAN_FILE;
+			goto out;
+		}
 	}
 
 	/*
@@ -2577,7 +2676,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
+	if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
@@ -3262,7 +3361,8 @@ static void age_active_anon(struct pglist_data *pgdat,
 {
 	struct mem_cgroup *memcg;
 
-	if (!total_swap_pages)
+	/* Aging anon page as long as demotion is fine */
+	if (!total_swap_pages && !is_demote_ok(pgdat->node_id))
 		return;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
@@ -4003,11 +4103,6 @@ static int __init kswapd_init(void)
  */
 int node_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
-
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
  * of a node considered for each zone_reclaim. 4 scans 1/16th of
@@ -4084,8 +4179,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		.gfp_mask = current_gfp_context(gfp_mask),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
-		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
+		.may_writepage = !!((node_reclaim_mode & RECLAIM_WRITE) ||
+				    (node_reclaim_mode & RECLAIM_MIGRATE)),
+		.may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) ||
+				(node_reclaim_mode & RECLAIM_MIGRATE)),
 		.may_swap = 1,
 		.reclaim_idx = gfp_zone(gfp_mask),
 	};
@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) {
+	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
+	    (node_reclaim_mode & RECLAIM_MIGRATE)) {
 		/*
 		 * Free memory by calling shrink node with increasing
 		 * priorities until we have enough memory freed.
@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	 * thrown out if the node is overallocated. So we do not reclaim
 	 * if less than a specified percentage of the node is used by
 	 * unmapped file backed pages.
+	 *
+	 * Migrate mode doesn't care the above restrictions.
 	 */
 	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
+	    node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages &&
+	    !(node_reclaim_mode & RECLAIM_MIGRATE))
 		return NODE_RECLAIM_FULL;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (4 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 5/9] mm: vmscan: demote anon DRAM pages to migration target node Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not Yang Shi
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

The memcg reclaim happens when the limit is breached, but demotion just
migrate pages to the other node instead of reclaiming them.  This sounds
pointless to memcg reclaim since the usage is not reduced at all.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/vmscan.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 428a83b..fb931ded 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1126,12 +1126,16 @@ static inline struct page *alloc_demote_page(struct page *page,
 }
 #endif
 
-static inline bool is_demote_ok(int nid)
+static inline bool is_demote_ok(int nid, struct scan_control *sc)
 {
 	/* Just do demotion with migrate mode of node reclaim */
 	if (!(node_reclaim_mode & RECLAIM_MIGRATE))
 		return false;
 
+	/* It is pointless to do demotion in memcg reclaim */
+	if (!global_reclaim(sc))
+		return false;
+
 	/* Current node is cpuless node */
 	if (!node_state(nid, N_CPU_MEM))
 		return false;
@@ -1326,7 +1330,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 * Demotion only happen from primary nodes
 				 * to cpuless nodes.
 				 */
-				if (is_demote_ok(page_to_nid(page))) {
+				if (is_demote_ok(page_to_nid(page), sc)) {
 					list_add(&page->lru, &demote_pages);
 					unlock_page(page);
 					continue;
@@ -2226,7 +2230,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	 * anonymous page deactivation is pointless.
 	 */
 	if (!file && !total_swap_pages &&
-	    !is_demote_ok(pgdat->node_id))
+	    !is_demote_ok(pgdat->node_id, sc))
 		return false;
 
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2307,7 +2311,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 *
 	 * If current node is already PMEM node, demotion is not applicable.
 	 */
-	if (!is_demote_ok(pgdat->node_id)) {
+	if (!is_demote_ok(pgdat->node_id, sc)) {
 		/*
 		 * If we have no swap space, do not bother scanning
 		 * anon pages.
@@ -2316,18 +2320,18 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 			scan_balance = SCAN_FILE;
 			goto out;
 		}
+	}
 
-		/*
-		 * Global reclaim will swap to prevent OOM even with no
-		 * swappiness, but memcg users want to use this knob to
-		 * disable swapping for individual groups completely when
-		 * using the memory controller's swap limit feature would be
-		 * too expensive.
-		 */
-		if (!global_reclaim(sc) && !swappiness) {
-			scan_balance = SCAN_FILE;
-			goto out;
-		}
+	/*
+	 * Global reclaim will swap to prevent OOM even with no
+	 * swappiness, but memcg users want to use this knob to
+	 * disable swapping for individual groups completely when
+	 * using the memory controller's swap limit feature would be
+	 * too expensive.
+	 */
+	if (!global_reclaim(sc) && !swappiness) {
+		scan_balance = SCAN_FILE;
+		goto out;
 	}
 
 	/*
@@ -2676,7 +2680,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	 */
 	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id))
+	if (get_nr_swap_pages() > 0 || is_demote_ok(pgdat->node_id, sc))
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
@@ -3362,7 +3366,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 	struct mem_cgroup *memcg;
 
 	/* Aging anon page as long as demotion is fine */
-	if (!total_swap_pages && !is_demote_ok(pgdat->node_id))
+	if (!total_swap_pages && !is_demote_ok(pgdat->node_id, sc))
 		return;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (5 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 8/9] mm: vmscan: add page demotion counter Yang Shi
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

When demoting to the migration target node, the target node may have
memory pressure, then the memory pressure may cause migrate_pages()
fail.

If the failure is caused by memory pressure (i.e. returning -ENOMEM),
tag the node with PGDAT_CONTENDED.  The tag would be cleared once the
target node is balanced again.

Check if the target node is PGDAT_CONTENDED or not, if it is just skip
demotion.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/mmzone.h |  3 +++
 mm/vmscan.c            | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 70394ca..d4e05c5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -573,6 +573,9 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_CONTENDED,		/* the node has not enough free memory
+					 * available
+					 */
 };
 
 enum zone_flags {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb931ded..9ec55d7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1126,6 +1126,21 @@ static inline struct page *alloc_demote_page(struct page *page,
 }
 #endif
 
+static inline bool is_migration_target_contended(int nid)
+{
+	int node;
+	nodemask_t used_mask;
+
+
+	nodes_clear(used_mask);
+	node = find_next_best_node(nid, &used_mask, true);
+
+	if (test_bit(PGDAT_CONTENDED, &NODE_DATA(node)->flags))
+		return true;
+
+	return false;
+}
+
 static inline bool is_demote_ok(int nid, struct scan_control *sc)
 {
 	/* Just do demotion with migrate mode of node reclaim */
@@ -1144,6 +1159,10 @@ static inline bool is_demote_ok(int nid, struct scan_control *sc)
 	if (!has_migration_target_node_online())
 		return false;
 
+	/* Check if the demote target node is contended or not */
+	if (is_migration_target_contended(nid))
+		return false;
+
 	return true;
 }
 
@@ -1564,6 +1583,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		nr_reclaimed += nr_succeeded;
 
 		if (err) {
+			if (err == -ENOMEM)
+				set_bit(PGDAT_CONTENDED,
+					&NODE_DATA(target_nid)->flags);
+
 			putback_movable_pages(&demote_pages);
 
 			list_splice(&ret_pages, &demote_pages);
@@ -2597,6 +2620,19 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 		 * scan target and the percentage scanning already complete
 		 */
 		lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
+
+		/*
+		 * The shrink_page_list() may find the demote target node is
+		 * contended, if so it doesn't make sense to scan anonymous
+		 * LRU again.
+		 *
+		 * Need check if swap is available or not too since demotion
+		 * may happen on swapless system.
+		 */
+		if (!is_demote_ok(pgdat->node_id, sc) &&
+		    (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0))
+			lru = LRU_FILE;
+
 		nr_scanned = targets[lru] - nr[lru];
 		nr[lru] = targets[lru] * (100 - percentage) / 100;
 		nr[lru] -= min(nr[lru], nr_scanned);
@@ -3447,6 +3483,7 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
 	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
 	clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+	clear_bit(PGDAT_CONTENDED, &pgdat->flags);
 }
 
 /*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 8/9] mm: vmscan: add page demotion counter
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (6 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-13 23:29 ` [v3 PATCH 9/9] mm: numa: add page promotion counter Yang Shi
  2019-06-27  2:57 ` [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

Account the number of demoted pages into reclaim_state->nr_demoted.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/vm_event_item.h | 2 ++
 include/linux/vmstat.h        | 1 +
 mm/vmscan.c                   | 8 ++++++++
 mm/vmstat.c                   | 2 ++
 4 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441..499a3aa 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGDEMOTE_KSWAPD,
+		PGDEMOTE_DIRECT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bdeda4b..00d53d4 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -29,6 +29,7 @@ struct reclaim_stat {
 	unsigned nr_activate[2];
 	unsigned nr_ref_keep;
 	unsigned nr_unmap_fail;
+	unsigned nr_demoted;
 };
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9ec55d7..f65cd45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -130,6 +130,7 @@ struct scan_control {
 		unsigned int immediate;
 		unsigned int file_taken;
 		unsigned int taken;
+		unsigned int demoted;
 	} nr;
 };
 
@@ -1582,6 +1583,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		nr_reclaimed += nr_succeeded;
 
+		stat->nr_demoted = nr_succeeded;
+		if (current_is_kswapd())
+			__count_vm_events(PGDEMOTE_KSWAPD, stat->nr_demoted);
+		else
+			__count_vm_events(PGDEMOTE_DIRECT, stat->nr_demoted);
+
 		if (err) {
 			if (err == -ENOMEM)
 				set_bit(PGDAT_CONTENDED,
@@ -2097,6 +2104,7 @@ static int current_may_throttle(void)
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
+	sc->nr.demoted += stat.nr_demoted;
 	sc->nr.taken += nr_taken;
 	if (file)
 		sc->nr.file_taken += nr_taken;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d876ac0..eee29a9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1192,6 +1192,8 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [v3 PATCH 9/9] mm: numa: add page promotion counter
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (7 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 8/9] mm: vmscan: add page demotion counter Yang Shi
@ 2019-06-13 23:29 ` Yang Shi
  2019-06-27  2:57 ` [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-13 23:29 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: yang.shi, linux-mm, linux-kernel

Add counter for page promotion for NUMA balancing.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 4 ++++
 mm/memory.c                   | 4 ++++
 mm/vmstat.c                   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 499a3aa..9f52a62 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -51,6 +51,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_PAGE_PROMOTE,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9..01cfe29 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,6 +1638,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
 				vmf->pmd, pmd, vmf->address, page, target_nid);
 	if (migrated) {
+		if (!node_state(page_nid, N_CPU_MEM) &&
+		    node_state(target_nid, N_CPU_MEM))
+			count_vm_numa_events(NUMA_PAGE_PROMOTE, HPAGE_PMD_NR);
+
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
 	} else
diff --git a/mm/memory.c b/mm/memory.c
index 96f1d47..e554cd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3770,6 +3770,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated) {
+		if (!node_state(page_nid, N_CPU_MEM) &&
+		    node_state(target_nid, N_CPU_MEM))
+			count_vm_numa_event(NUMA_PAGE_PROMOTE);
+
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
 	} else
diff --git a/mm/vmstat.c b/mm/vmstat.c
index eee29a9..0140736 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1220,6 +1220,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_pages_promoted",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy
  2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
                   ` (8 preceding siblings ...)
  2019-06-13 23:29 ` [v3 PATCH 9/9] mm: numa: add page promotion counter Yang Shi
@ 2019-06-27  2:57 ` Yang Shi
  9 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2019-06-27  2:57 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, ziy
  Cc: linux-mm, linux-kernel

Hi folks,


Any comment on this version?


Thanks,

Yang



On 6/13/19 4:29 PM, Yang Shi wrote:
> With Dave Hansen's patches merged into Linus's tree
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>
> PMEM could be hot plugged as NUMA node now.  But, how to use PMEM as NUMA
> node effectively and efficiently is worth exploring.
>
> There have been a couple of proposals posted on the mailing list [1] [2] [3].
>
> I already posted two versions of patchset for demoting/promoting memory pages
> between DRAM and PMEM before this topic was discussed at LSF/MM 2019
> (https://lwn.net/Articles/787418/).  I do appreciate all the great suggestions
> from the community.  This updated version implemented the most discussion,
> please see the below design section for the details.
>
>
> Changelog
> =========
> v2 --> v3:
> * Introduced "migrate mode" for node reclaim.  Just do demotion when
>    "migrate mode" is specified per Michal Hocko and Mel Gorman.
> * Introduced "migrate target" concept for VM per Mel Gorman.  The memory nodes
>    which are under DRAM in the hierarchy (i.e. lower bandwidth, higher latency,
>    larger capacity and cheaper than DRAM) are considered as "migrate target"
>    nodes.  When "migrate mode" is on, memory reclaim would demote pages to
>    the "migrate target" nodes.
> * Dropped "twice access" promotion patch per Michal Hocko.
> * Changed the subject for the patchset to reflect the update.
> * Rebased to 5.2-rc1.
>
> v1 --> v2:
> * Dropped the default allocation node mask.  The memory placement restriction
>    could be achieved by mempolicy or cpuset.
> * Dropped the new mempolicy since its semantic is not that clear yet.
> * Dropped PG_Promote flag.
> * Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
> * Extended page_check_references() to implement "twice access" check for
>    anonymous page in NUMA balancing path.
> * Reworked the memory demotion code.
>
> v2: https://lore.kernel.org/linux-mm/1554955019-29472-1-git-send-email-yang.shi@linux.alibaba.com/
> v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/
>
>
> Design
> ======
> With the development of new memory technology, we could have cheaper and
> larger memory device on the system, which may have higher latency and lower
> bandwidth than DRAM, i.e. PMEM.  It could be used as persistent storage or
> volatile memory.
>
> It fits into the memory hierarchy as a second tier memory.  The patchset
> tries to explore an approach to utilize such memory to improve the memory
> placement.  Basically, the patchset tries to achieve this goal by doing
> memory promotion/demotion via NUMA balancing and memory reclaim.
>
> Introduce a new "migrate" mode for node reclaim.  When DRAM has memory
> pressure, demote pages to PMEM via node reclaim path if "migrate" mode is
> on.  Then NUMA balancing will promote pages to DRAM as long as the page is
> referenced again.  The memory pressure on PMEM node would push the inactive
> pages of PMEM to disk via swap.
>
> Introduce "primary" node and "migrate target" node concepts for VM (patch 1/9
> and 2/9).  The "primary" node is the node which has both CPU and memory.  The
> "migrate target" node is cpuless node and under DRAM in memory hierarchy
> (i.e. PMEM may be a suitable one, which has lower bandwidth, higher latency,
> larger capacity and is cheaper than DRAM).  The firmware is effectively going
> to enforce "cpu-less" nodes for any memory range that has differentiated
> performance from the conventional memory pool, or differentiated performance
> for a specific initiator.
>
> Defined "N_CPU_MEM" nodemask for the "primary" nodes in order to distinguish
> with cpuless nodes (memory only, i.e. PMEM nodes) and memoryless nodes (some
> architectures, i.e. Power, may have memoryless nodes).
>
> It is a little bit hard to find out suitable "migrate target" node since this
> needs firmware exposes the physical characteristics of the memory devices.
> I'm not quite sure what should be the best way and if it is ready to use now
> or not.  Since PMEM is the only available such device for now, so it sounds
> retrieving the information from SRAT is the easiest way.  We may figure out a
> better way in the future.
>
> The promotion/demotion happens only between "primary" nodes and "migrate target"
> nodes.  No promotion/demotion between "migrate target" nodes and promotion from
> "primary" nodes to "migrate target" nodes and demotion from "primary" nodes to
> "migrate target" nodes.  This guarantees there is no cycles for memory demotion
> or promotion.
>
> According to the discussion at LFS/MM 2019, "there should only be one node to
> which pages could be migrated".   So reclaim code just tries to demote the pages
> to the closest "migrate target" node and only tries once.  Otherwise "if all
> nodes in the system were on a fallback list, a page would have to move through
> every possible option - each RAM-based node and each persistent-memory node -
> before actually being reclaimed. It would be necessary to maintain the history
> of where each page has been, and would be likely to disrupt other workloads on
> the system".  This is what v2 patchset does, so keep doing it in the same way
> in v3.
>
> The demotion code moves all the migration candidate pages into one single list,
> then migrate them together (including THP).  This would improve the efficiency
> of migration according to Zi Yan's research.  If the migration fails, the
> unmigrated pages will be put back to LRU.
>
> Use the most opotimistic GFP flags to allocate pages on the "migrate target"
> node.
>   
> To reduce the failure rate of demotion, check if the "migrate target" node is
> contended or not.  If the "migrate target" node is contended, just do swap
> instead of migrate.  If migration is failed due to -ENOMEM, mark the node as
> contended.  The contended flag will be cleared once the node get balanced.
>
> For now "migrate" mode is not compatible with cpuset and mempolicy since it
> is hard to get the process's task_struct from struct page.  The cpuset and
> process's mempolicy are stored in task_struct instead of mm_struct.
>
> Anonymous page only for the time being since NUMA balancing can't promote
> unmapped page cache.  Page cache can be demoted easily, but promotion is a
> question, may do it via mark_page_accessed().
>
> Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
> numa_pages_promoted.
>
> There are definitely still a lot of details need to be sorted out.  Any
> comment is welcome.
>
>
> Test
> ====
> The stress test was done with mmtests + applications workload (i.e. sysbench,
> grep, etc).
>
> Generate memory pressure by running mmtest's usemem-stress-numa-compact,
> then run other applications as workload to stress the promotion and demotion
> path.  The machine was still alive after the stress test had been running for
> ~30 hours.  The /proc/vmstat also shows:
>
> ...
> pgdemote_kswapd 3316563
> pgdemote_direct 1930721
> ...
> numa_pages_promoted 81838
>
>
> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
> [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/
> [3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@dhcp22.suse.cz/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d
>
>
> Yang Shi (9):
>        mm: define N_CPU_MEM node states
>        mm: Introduce migrate target nodemask
>        mm: page_alloc: make find_next_best_node find return migration target node
>        mm: migrate: make migrate_pages() return nr_succeeded
>        mm: vmscan: demote anon DRAM pages to migration target node
>        mm: vmscan: don't demote for memcg reclaim
>        mm: vmscan: check if the demote target node is contended or not
>        mm: vmscan: add page demotion counter
>        mm: numa: add page promotion counter
>
>   Documentation/sysctl/vm.txt    |   6 +++
>   drivers/acpi/numa.c            |  12 +++++
>   drivers/base/node.c            |   4 ++
>   include/linux/gfp.h            |  12 +++++
>   include/linux/migrate.h        |   6 ++-
>   include/linux/mmzone.h         |   3 ++
>   include/linux/nodemask.h       |   4 +-
>   include/linux/vm_event_item.h  |   3 ++
>   include/linux/vmstat.h         |   1 +
>   include/trace/events/migrate.h |   3 +-
>   mm/compaction.c                |   3 +-
>   mm/debug.c                     |   1 +
>   mm/gup.c                       |   4 +-
>   mm/huge_memory.c               |   4 ++
>   mm/internal.h                  |  23 ++++++++
>   mm/memory-failure.c            |   7 ++-
>   mm/memory.c                    |   4 ++
>   mm/memory_hotplug.c            |  10 +++-
>   mm/mempolicy.c                 |   7 ++-
>   mm/migrate.c                   |  33 ++++++++----
>   mm/page_alloc.c                |  20 +++++--
>   mm/vmscan.c                    | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>   mm/vmstat.c                    |  14 ++++-
>   23 files changed, 323 insertions(+), 47 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-06-27  2:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-13 23:29 [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi
2019-06-13 23:29 ` [v3 PATCH 1/9] mm: define N_CPU_MEM node states Yang Shi
2019-06-13 23:29 ` [v3 PATCH 2/9] mm: Introduce migrate target nodemask Yang Shi
2019-06-13 23:29 ` [v3 PATCH 3/9] mm: page_alloc: make find_next_best_node find return migration target node Yang Shi
2019-06-13 23:29 ` [v3 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded Yang Shi
2019-06-13 23:29 ` [v3 PATCH 5/9] mm: vmscan: demote anon DRAM pages to migration target node Yang Shi
2019-06-13 23:29 ` [v3 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim Yang Shi
2019-06-13 23:29 ` [v3 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not Yang Shi
2019-06-13 23:29 ` [v3 PATCH 8/9] mm: vmscan: add page demotion counter Yang Shi
2019-06-13 23:29 ` [v3 PATCH 9/9] mm: numa: add page promotion counter Yang Shi
2019-06-27  2:57 ` [v3 RFC PATCH 0/9] Migrate mode for node reclaim with heterogeneous memory hierarchy Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).