All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-23  4:44 Yang Shi
  2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
                   ` (11 more replies)
  0 siblings, 12 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel


With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
effectively and efficiently is still a question. 

There have been a couple of proposals posted on the mailing list [1] [2].

The patchset is aimed to try a different approach from this proposal [1]
to use PMEM as NUMA nodes.

The approach is designed to follow the below principles:

1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.

2. DRAM first/by default. No surprise to existing applications and default
running. PMEM will not be allocated unless its node is specified explicitly
by NUMA policy. Some applications may be not very sensitive to memory latency,
so they could be placed on PMEM nodes then have hot pages promote to DRAM
gradually.

3. Compatible with current NUMA policy semantics.

4. Don't assume hardware topology. But, the patchset still assumes two tier
heterogeneous memory system. I understood generalizing multi tier heterogeneous
memory had been discussed before. I do agree that is preferred eventually.
However, currently kernel doesn't have such capability yet. When HMAT is fully
ready we definitely could extract NUMA topology from it.

5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
basis.

To achieve the above principles, the design can be summarized by the
following points:

1. Per node global fallback zonelists (include both DRAM and PMEM), use
def_alloc_nodemask to exclude non-DRAM nodes from default allocation unless
they are specified by mempolicy. Currently kernel just can distinguish volatile
and non-volatile memory. So, just build the nodemask by SRAT flag. In the
future it may be better to build the nodemask with more exposed hardware
information, i.e. HMAT attributes so that it could be extended to multi tier
memory system easily.

2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
semantics intact. We would like to have memory placement control on per process
or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
The new mempolicy is mainly used for launching processes on PMEM nodes then
migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
a new mempolicy is needed to fulfill the usecase.

3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
don't think kernel is a good place to implement sophisticated hot/cold page
distinguish algorithm due to the complexity and overhead. But, kernel should
have such capability. NUMA balancing sounds like a good start point.

4. Promote twice faulted page. Use PG_promote to track if a page is faulted
twice. This is an optimization to NUMA balancing to reduce the migration
thrashing and overhead for migrating from PMEM.

5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
This is quite similar to other proposals. Then NUMA balancing will promote
page to DRAM as long as the page is referenced again. But, the
promotion/demotion still assumes two tier main memory. And, the demotion may
break mempolicy.

6. Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache.

The patchset still misses a some pieces and is pre-mature, but I would like to
post to LKML to gather more feedback and comments and have more eyes on it to
make sure I'm on the right track.

Any comment is welcome.


TODO:

1. Promote page cache. There are a couple of ways to handle this in kernel,
i.e. promote via active LRU in reclaim path on PMEM node, or promote in
mark_page_accessed().

2. Promote/demote HugeTLB. Now HugeTLB is not on LRU and NUMA balancing just
skips it.

3. May place kernel pages (i.e. page table, slabs, etc) on DRAM only.

4. Support the new mempolicy in userspace tools, i.e. numactl.


[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/


Yang Shi (10):
      mm: control memory placement by nodemask for two tier main memory
      mm: mempolicy: introduce MPOL_HYBRID policy
      mm: mempolicy: promote page to DRAM for MPOL_HYBRID
      mm: numa: promote pages to DRAM when it is accessed twice
      mm: page_alloc: make find_next_best_node could skip DRAM node
      mm: vmscan: demote anon DRAM pages to PMEM node
      mm: vmscan: add page demotion counter
      mm: numa: add page promotion counter
      doc: add description for MPOL_HYBRID mode
      doc: elaborate the PMEM allocation rule

 Documentation/admin-guide/mm/numa_memory_policy.rst |  10 ++++
 Documentation/vm/numa.rst                           |   7 ++-
 arch/x86/mm/numa.c                                  |   1 +
 drivers/acpi/numa.c                                 |   8 +++
 include/linux/migrate.h                             |   1 +
 include/linux/mmzone.h                              |   3 ++
 include/linux/page-flags.h                          |   4 ++
 include/linux/vm_event_item.h                       |   3 ++
 include/linux/vmstat.h                              |   1 +
 include/trace/events/migrate.h                      |   3 +-
 include/trace/events/mmflags.h                      |   3 +-
 include/uapi/linux/mempolicy.h                      |   1 +
 mm/debug.c                                          |   1 +
 mm/huge_memory.c                                    |  14 ++++++
 mm/internal.h                                       |  33 ++++++++++++
 mm/memory.c                                         |  12 +++++
 mm/mempolicy.c                                      |  74 ++++++++++++++++++++++++---
 mm/page_alloc.c                                     |  33 +++++++++---
 mm/vmscan.c                                         | 113 +++++++++++++++++++++++++++++++++++-------
 mm/vmstat.c                                         |   3 ++
 20 files changed, 295 insertions(+), 33 deletions(-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23 17:21     ` Dan Williams
  2019-03-23  4:44 ` [PATCH 02/10] mm: mempolicy: introduce MPOL_HYBRID policy Yang Shi
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

When running applications on the machine with NVDIMM as NUMA node, the
memory allocation may end up on NVDIMM node.  This may result in silent
performance degradation and regression due to the difference of hardware
property.

DRAM first should be obeyed to prevent from surprising regression.  Any
non-DRAM nodes should be excluded from default allocation.  Use nodemask
to control the memory placement.  Introduce def_alloc_nodemask which has
DRAM nodes set only.  Any non-DRAM allocation should be specified by
NUMA policy explicitly.

In the future we may be able to extract the memory charasteristics from
HMAT or other source to build up the default allocation nodemask.
However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
for the time being.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 arch/x86/mm/numa.c     |  1 +
 drivers/acpi/numa.c    |  8 ++++++++
 include/linux/mmzone.h |  3 +++
 mm/page_alloc.c        | 18 ++++++++++++++++--
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index dfb6c4d..d9e0ca4 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
 	nodes_clear(numa_nodes_parsed);
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
+	nodes_clear(def_alloc_nodemask);
 	memset(&numa_meminfo, 0, sizeof(numa_meminfo));
 	WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
 				  MAX_NUMNODES));
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 867f6e3..79dfedf 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
 		goto out_err_bad_srat;
 	}
 
+	/*
+	 * Non volatile memory is excluded from zonelist by default.
+	 * Only regular DRAM nodes are set in default allocation node
+	 * mask.
+	 */
+	if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
+		node_set(node, def_alloc_nodemask);
+
 	node_set(node, numa_nodes_parsed);
 
 	pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..063c3b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -927,6 +927,9 @@ extern int numa_zonelist_order_handler(struct ctl_table *, int,
 extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
 extern struct zone *next_zone(struct zone *zone);
 
+/* Regular DRAM nodes */
+extern nodemask_t def_alloc_nodemask;
+
 /**
  * for_each_online_pgdat - helper macro to iterate over all online nodes
  * @pgdat - pointer to a pg_data_t variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03fcf73..68ad8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -134,6 +134,8 @@ struct pcpu_drain {
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+nodemask_t def_alloc_nodemask __read_mostly;
+
 /*
  * A cached value of the page's pageblock's migratetype, used when the page is
  * put on a pcplist. Used to avoid the pageblock migratetype lookup when
@@ -4524,12 +4526,24 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
-	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
+	if (!nodemask) {
+		/* Non-DRAM node is preferred node */
+		if (!node_isset(preferred_nid, def_alloc_nodemask))
+			/*
+			 * With MPOL_PREFERRED policy, once PMEM is allowed,
+			 * can falback to all memory nodes.
+			 */
+			ac->nodemask = &node_states[N_MEMORY];
+		else
+			ac->nodemask = &def_alloc_nodemask;
+	} else
+		ac->nodemask = nodemask;
+
 	if (cpusets_enabled()) {
 		*alloc_mask |= __GFP_HARDWALL;
-		if (!ac->nodemask)
+		if (nodes_equal(*ac->nodemask, def_alloc_nodemask))
 			ac->nodemask = &cpuset_current_mems_allowed;
 		else
 			*alloc_flags |= ALLOC_CPUSET;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/10] mm: mempolicy: introduce MPOL_HYBRID policy
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
  2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 03/10] mm: mempolicy: promote page to DRAM for MPOL_HYBRID Yang Shi
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Introduce a new NUMA policy, MPOL_HYBRID.  It behaves like MPOL_BIND,
but since we need migrate pages from non-DRAM node (i.e. PMEM node) to
DRAM node on demand, MPOL_HYBRID would do page migration on numa fault,
so it would have MPOL_F_MOF set by default.

The NUMA balancing stuff will be enabled in the following patch.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/uapi/linux/mempolicy.h |  1 +
 mm/mempolicy.c                 | 56 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3354774..0fdc73d 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_HYBRID,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index af171cc..7d0a432 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,10 @@
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * hybrid         Only allocate memory on specific set of nodes. If the set of
+ *                nodes include non-DRAM nodes, NUMA balancing would promote
+ *                the page to DRAM node.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -191,6 +195,17 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_hybrid(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+
+	/* Hybrid policy would promote pages in page fault */
+	pol->flags |= MPOL_F_MOF;
+	pol->v.nodes = *nodes;
+	return 0;
+}
+
 /*
  * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
  * any, for the new policy.  mpol_new() has already validated the nodes
@@ -401,6 +416,10 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 		.create = mpol_new_bind,
 		.rebind = mpol_rebind_nodemask,
 	},
+	[MPOL_HYBRID] = {
+		.create = mpol_new_hybrid,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -782,6 +801,8 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 		return;
 
 	switch (p->mode) {
+	case MPOL_HYBRID:
+		/* Fall through */
 	case MPOL_BIND:
 		/* Fall through */
 	case MPOL_INTERLEAVE:
@@ -1721,8 +1742,12 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
  */
 static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 {
-	/* Lower zones don't get a nodemask applied for MPOL_BIND */
-	if (unlikely(policy->mode == MPOL_BIND) &&
+	/*
+	 * Lower zones don't get a nodemask applied for MPOL_BIND
+	 * or MPOL_HYBRID.
+	 */
+	if (unlikely((policy->mode == MPOL_BIND) ||
+			(policy->mode == MPOL_HYBRID)) &&
 			apply_policy_zone(policy, gfp_zone(gfp)) &&
 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
 		return &policy->v.nodes;
@@ -1742,7 +1767,9 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy,
 		 * because we might easily break the expectation to stay on the
 		 * requested node and not break the policy.
 		 */
-		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
+		WARN_ON_ONCE((policy->mode == MPOL_BIND ||
+			     policy->mode == MPOL_HYBRID) &&
+			     (gfp & __GFP_THISNODE));
 	}
 
 	return nd;
@@ -1786,6 +1813,8 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_HYBRID:
+		/* Fall through */
 	case MPOL_BIND: {
 		struct zoneref *z;
 
@@ -1856,7 +1885,8 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
- * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
+ * @nodemask: pointer to nodemask pointer for MPOL_BIND or MPOL_HYBRID
+ * nodemask
  *
  * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
@@ -1871,14 +1901,16 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 	int nid;
 
 	*mpol = get_vma_policy(vma, addr);
-	*nodemask = NULL;	/* assume !MPOL_BIND */
+	/* assume !MPOL_BIND || !MPOL_HYBRID */
+	*nodemask = NULL;
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
 		nid = policy_node(gfp_flags, *mpol, numa_node_id());
-		if ((*mpol)->mode == MPOL_BIND)
+		if ((*mpol)->mode == MPOL_BIND ||
+		    (*mpol)->mode == MPOL_HYBRID)
 			*nodemask = &(*mpol)->v.nodes;
 	}
 	return nid;
@@ -1919,6 +1951,8 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 		init_nodemask_of_node(mask, nid);
 		break;
 
+	case MPOL_HYBRID:
+		/* Fall through */
 	case MPOL_BIND:
 		/* Fall through */
 	case MPOL_INTERLEAVE:
@@ -1966,6 +2000,7 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 		 * nodes in mask.
 		 */
 		break;
+	case MPOL_HYBRID:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		ret = nodes_intersects(mempolicy->v.nodes, *mask);
@@ -2170,6 +2205,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 			return false;
 
 	switch (a->mode) {
+	case MPOL_HYBRID:
+		/* Fall through */
 	case MPOL_BIND:
 		/* Fall through */
 	case MPOL_INTERLEAVE:
@@ -2325,6 +2362,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 			polnid = pol->v.preferred_node;
 		break;
 
+	case MPOL_HYBRID:
+		/* Fall through */
+
 	case MPOL_BIND:
 
 		/*
@@ -2693,6 +2733,7 @@ void numa_default_policy(void)
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_HYBRID]     = "hybrid",
 };
 
 
@@ -2768,6 +2809,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_HYBRID:
+		/* Fall through */
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -2856,6 +2899,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		else
 			node_set(pol->v.preferred_node, nodes);
 		break;
+	case MPOL_HYBRID:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/10] mm: mempolicy: promote page to DRAM for MPOL_HYBRID
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
  2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
  2019-03-23  4:44 ` [PATCH 02/10] mm: mempolicy: introduce MPOL_HYBRID policy Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice Yang Shi
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

With MPOL_HYBRID the memory allocation may end up on non-DRAM node, this
may be not optimal for performance.  Promote pages to DRAM with NUMA
balancing for MPOL_HYBRID.

If DRAM nodes are specified, migrate to the specified nodes.  If no DRAM
node is specified, migrate to the local DRAM node.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/mempolicy.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7d0a432..87bc691 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2339,6 +2339,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zoneref *z;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
+	nodemask_t nmask;
 	int thiscpu = raw_smp_processor_id();
 	int thisnid = cpu_to_node(thiscpu);
 	int polnid = NUMA_NO_NODE;
@@ -2363,7 +2364,24 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_HYBRID:
-		/* Fall through */
+		if (node_isset(curnid, pol->v.nodes) &&
+		    node_isset(curnid, def_alloc_nodemask))
+			/* The page is already on DRAM node */
+			goto out;
+
+		/*
+		 * Promote to the DRAM node specified by the policy, or
+		 * the local DRAM node if no DRAM node is specified.
+		 */
+		nodes_and(nmask, pol->v.nodes, def_alloc_nodemask);
+
+		z = first_zones_zonelist(
+			node_zonelist(numa_node_id(), GFP_HIGHUSER),
+			gfp_zone(GFP_HIGHUSER),
+			nodes_empty(nmask) ? &def_alloc_nodemask : &nmask);
+		polnid = z->zone->node;
+
+		break;
 
 	case MPOL_BIND:
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (2 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 03/10] mm: mempolicy: promote page to DRAM for MPOL_HYBRID Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-29  0:31   ` kbuild test robot
  2019-03-23  4:44 ` [PATCH 05/10] mm: page_alloc: make find_next_best_node could skip DRAM node Yang Shi
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

NUMA balancing would promote the pages to DRAM once it is accessed, but
it might be just one off access.  To reduce migration thrashing and
memory bandwidth pressure, introduce PG_promote flag to mark promote
candidate.  The page will be promoted to DRAM when it is accessed twice.
This might be a good way to filter out those one-off access pages.

PG_promote flag will be inherited by tail pages when THP gets split.
But, it will not be copied to the new page once the migration is done.

This approach is not definitely the optimal one to distinguish the
hot or cold pages.  It may need much more sophisticated algorithm to
distinguish hot or cold pages accurately.  Kernel may be not the good
place to implement such algorithm considering the complexity and potential
overhead.  But, kernel may still need such capability.

With NUMA balancing the whole workingset of the process may end up being
promoted to DRAM finally.  It depends on the page reclaim to demote
inactive pages to PMEM implemented by the following patch.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/page-flags.h     |  4 ++++
 include/trace/events/mmflags.h |  3 ++-
 mm/huge_memory.c               | 10 ++++++++++
 mm/memory.c                    |  8 ++++++++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a..2d53166 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -131,6 +131,7 @@ enum pageflags {
 	PG_young,
 	PG_idle,
 #endif
+	PG_promote,		/* Promote candidate for NUMA balancing */
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -348,6 +349,9 @@ static inline void page_init_poison(struct page *page, size_t size)
 PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
 	TESTCLEARFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
 
+PAGEFLAG(Promote, promote, PF_ANY) __SETPAGEFLAG(Promote, promote, PF_ANY)
+	__CLEARPAGEFLAG(Promote, promote, PF_ANY)
+
 /*
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d4..f13c2a1 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -100,7 +100,8 @@
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},		\
 	{1UL << PG_reclaim,		"reclaim"	},		\
 	{1UL << PG_swapbacked,		"swapbacked"	},		\
-	{1UL << PG_unevictable,		"unevictable"	}		\
+	{1UL << PG_unevictable,		"unevictable"	},		\
+	{1UL << PG_promote,		"promote"	}		\
 IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
 IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
 IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 404acdc..8268a3c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1589,6 +1589,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 					      haddr + HPAGE_PMD_SIZE);
 	}
 
+	/* Promote page to DRAM when referenced twice */
+	if (!(node_isset(page_nid, def_alloc_nodemask)) &&
+	    !PagePromote(page)) {
+		SetPagePromote(page);
+		put_page(page);
+		page_nid = -1;
+		goto clear_pmdnuma;
+	}
+
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and access rights restored.
@@ -2396,6 +2405,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_workingset) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
+			 (1L << PG_promote) |
 			 (1L << PG_dirty)));
 
 	/* ->mapping in first tail page is compound_mapcount */
diff --git a/mm/memory.c b/mm/memory.c
index 47fe250..2494c11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3680,6 +3680,14 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		goto out;
 	}
 
+	/* Promote the non-DRAM page when it is referenced twice */
+	if (!(node_isset(page_nid, def_alloc_nodemask)) &&
+	    !PagePromote(page)) {
+		SetPagePromote(page);
+		put_page(page);
+		goto out;
+	}
+
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/10] mm: page_alloc: make find_next_best_node could skip DRAM node
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (3 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Need find the cloest non-DRAM node to demote DRAM pages.  Add
"skip_ram_node" parameter to find_next_best_node() to skip DRAM node on
demand.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/internal.h   | 11 +++++++++++
 mm/page_alloc.c | 15 +++++++++++----
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b..46ad0d8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -292,6 +292,17 @@ static inline bool is_data_mapping(vm_flags_t flags)
 	return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
 }
 
+#ifdef CONFIG_NUMA
+extern int find_next_best_node(int node, nodemask_t *used_node_mask,
+			       bool skip_ram_node);
+#else
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
+				      bool skip_ram_node)
+{
+	return 0;
+}
+#endif
+
 /* mm/util.c */
 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 68ad8c6..07d767b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5375,6 +5375,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
  * find_next_best_node - find the next node that should appear in a given node's fallback list
  * @node: node whose fallback list we're appending
  * @used_node_mask: nodemask_t of already used nodes
+ * @skip_ram_node: find next best non-DRAM node
  *
  * We use a number of factors to determine which is the next node that should
  * appear on a given node's fallback list.  The node should not have appeared
@@ -5386,7 +5387,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask,
+			bool skip_ram_node)
 {
 	int n, val;
 	int min_val = INT_MAX;
@@ -5394,13 +5396,19 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	const struct cpumask *tmp = cpumask_of_node(0);
 
 	/* Use the local node if we haven't already */
-	if (!node_isset(node, *used_node_mask)) {
+	if (!node_isset(node, *used_node_mask) &&
+	    !skip_ram_node) {
 		node_set(node, *used_node_mask);
 		return node;
 	}
 
 	for_each_node_state(n, N_MEMORY) {
 
+		/* Find next best non-DRAM node */
+		if (skip_ram_node &&
+		    (node_isset(n, def_alloc_nodemask)))
+			continue;
+
 		/* Don't want a node to appear more than once */
 		if (node_isset(n, *used_node_mask))
 			continue;
@@ -5432,7 +5440,6 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	return best_node;
 }
 
-
 /*
  * Build zonelists ordered by node and zones within node.
  * This results in maximum locality--normal zone overflows into local
@@ -5494,7 +5501,7 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+	while ((node = find_next_best_node(local_node, &used_mask, false)) >= 0) {
 		/*
 		 * We don't want to pressure a particular node.
 		 * So adding penalty to the first node in same
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (4 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 05/10] mm: page_alloc: make find_next_best_node could skip DRAM node Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  6:03   ` Zi Yan
  2019-03-24 22:20   ` Keith Busch
  2019-03-23  4:44 ` [PATCH 07/10] mm: vmscan: add page demotion counter Yang Shi
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Since PMEM provides larger capacity than DRAM and has much lower
access latency than disk, so it is a good choice to use as a middle
tier between DRAM and disk in page reclaim path.

With PMEM nodes, the demotion path of anonymous pages could be:

DRAM -> PMEM -> swap device

This patch demotes anonymous pages only for the time being and demote
THP to PMEM in a whole.  However this may cause expensive page reclaim
and/or compaction on PMEM node if there is memory pressure on it.  But,
considering the capacity of PMEM and allocation only happens on PMEM
when PMEM is specified explicity, such cases should be not that often.
So, it sounds worth keeping THP in a whole instead of splitting it.

Demote pages to the cloest non-DRAM node even though the system is
swapless.  The current logic of page reclaim just scan anon LRU when
swap is on and swappiness is set properly.  Demoting to PMEM doesn't
need care whether swap is available or not.  But, reclaiming from PMEM
still skip anon LRU is swap is not available.

The demotion just happens between DRAM node and its cloest PMEM node.
Demoting to a remote PMEM node is not allowed for now.

And, define a new migration reason for demotion, called MR_DEMOTE.
Demote page via async migration to avoid blocking.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/migrate.h        |  1 +
 include/trace/events/migrate.h |  3 +-
 mm/debug.c                     |  1 +
 mm/internal.h                  | 22 ++++++++++
 mm/vmscan.c                    | 99 ++++++++++++++++++++++++++++++++++--------
 5 files changed, 107 insertions(+), 19 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf..78c8dda 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTE,
 	MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d..c1d5b36 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTE,		"demote")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6..cc0d7df 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,6 +25,7 @@
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demote",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff --git a/mm/internal.h b/mm/internal.h
index 46ad0d8..0152300 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -303,6 +303,19 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
 }
 #endif
 
+static inline bool has_nonram_online(void)
+{
+	int i = 0;
+
+	for_each_online_node(i) {
+		/* Have PMEM node online? */
+		if (!node_isset(i, def_alloc_nodemask))
+			return true;
+	}
+
+	return false;
+}
+
 /* mm/util.c */
 void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct rb_node *rb_parent);
@@ -565,5 +578,14 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 }
 
 void setup_zone_pageset(struct zone *zone);
+
+#ifdef CONFIG_NUMA
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+#else
+static inline struct page *alloc_new_node_page(struct page *page,
+					       unsigned long node)
+{
+	return NULL;
+}
+#endif
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5ad0b3..bdcab6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1094,6 +1094,19 @@ static void page_check_dirty_writeback(struct page *page,
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+static inline bool is_demote_ok(struct pglist_data *pgdat)
+{
+	/* Current node is not DRAM node */
+	if (!node_isset(pgdat->node_id, def_alloc_nodemask))
+		return false;
+
+	/* No online PMEM node */
+	if (!has_nonram_online())
+		return false;
+
+	return true;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1106,6 +1119,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(demote_pages);
 	unsigned nr_reclaimed = 0;
 
 	memset(stat, 0, sizeof(*stat));
@@ -1262,6 +1276,22 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		/*
+		 * Demote DRAM pages regardless the mempolicy.
+		 * Demot anonymous pages only for now and skip MADV_FREE
+		 * pages.
+		 */
+		if (PageAnon(page) && !PageSwapCache(page) &&
+		    (node_isset(page_to_nid(page), def_alloc_nodemask)) &&
+		    PageSwapBacked(page)) {
+
+			if (has_nonram_online()) {
+				list_add(&page->lru, &demote_pages);
+				unlock_page(page);
+				continue;
+			}
+		}
+
+		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 * Lazyfree page could be freed directly
@@ -1477,6 +1507,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
+	/* Demote pages to PMEM */
+	if (!list_empty(&demote_pages)) {
+		int err, target_nid;
+		nodemask_t used_mask;
+
+		nodes_clear(used_mask);
+		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
+						 true);
+
+		err = migrate_pages(&demote_pages, alloc_new_node_page, NULL,
+				    target_nid, MIGRATE_ASYNC, MR_DEMOTE);
+
+		if (err) {
+			putback_movable_pages(&demote_pages);
+
+			list_splice(&ret_pages, &demote_pages);
+		}
+	}
+
 	mem_cgroup_uncharge_list(&free_pages);
 	try_to_unmap_flush();
 	free_unref_page_list(&free_pages);
@@ -2188,10 +2237,11 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	unsigned long gb;
 
 	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
+	 * If we don't have swap space or PMEM online, anonymous page
+	 * deactivation is pointless.
 	 */
-	if (!file && !total_swap_pages)
+	if (!file && !total_swap_pages &&
+	    !is_demote_ok(pgdat))
 		return false;
 
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
@@ -2271,22 +2321,34 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	unsigned long ap, fp;
 	enum lru_list lru;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
-		scan_balance = SCAN_FILE;
-		goto out;
-	}
-
 	/*
-	 * Global reclaim will swap to prevent OOM even with no
-	 * swappiness, but memcg users want to use this knob to
-	 * disable swapping for individual groups completely when
-	 * using the memory controller's swap limit feature would be
-	 * too expensive.
+	 * Anon pages can be demoted to PMEM. If there is PMEM node online,
+	 * still scan anonymous LRU even though the systme is swapless or
+	 * swapping is disabled by memcg.
+	 *
+	 * If current node is already PMEM node, demotion is not applicable.
 	 */
-	if (!global_reclaim(sc) && !swappiness) {
-		scan_balance = SCAN_FILE;
-		goto out;
+	if (!is_demote_ok(pgdat)) {
+		/*
+		 * If we have no swap space, do not bother scanning
+		 * anon pages.
+		 */
+		if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+			scan_balance = SCAN_FILE;
+			goto out;
+		}
+
+		/*
+		 * Global reclaim will swap to prevent OOM even with no
+		 * swappiness, but memcg users want to use this knob to
+		 * disable swapping for individual groups completely when
+		 * using the memory controller's swap limit feature would be
+		 * too expensive.
+		 */
+		if (!global_reclaim(sc) && !swappiness) {
+			scan_balance = SCAN_FILE;
+			goto out;
+		}
 	}
 
 	/*
@@ -3332,7 +3394,8 @@ static void age_active_anon(struct pglist_data *pgdat,
 {
 	struct mem_cgroup *memcg;
 
-	if (!total_swap_pages)
+	/* Aging anon page as long as demotion is fine */
+	if (!total_swap_pages && !is_demote_ok(pgdat))
 		return;
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/10] mm: vmscan: add page demotion counter
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (5 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 08/10] mm: numa: add page promotion counter Yang Shi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Demoted pages are counted into reclaim_state->nr_demoted instead of
nr_reclaimed since they are not reclaimed actually.  They are still in
memory, but just migrated to PMEM.

Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/vm_event_item.h |  2 ++
 include/linux/vmstat.h        |  1 +
 mm/vmscan.c                   | 14 ++++++++++++++
 mm/vmstat.c                   |  2 ++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441..499a3aa 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -32,6 +32,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGREFILL,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
+		PGDEMOTE_KSWAPD,
+		PGDEMOTE_DIRECT,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_DIRECT_THROTTLE,
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2db8d60..eb5d21c 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -29,6 +29,7 @@ struct reclaim_stat {
 	unsigned nr_activate;
 	unsigned nr_ref_keep;
 	unsigned nr_unmap_fail;
+	unsigned nr_demoted;
 };
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdcab6b..3c7ba7e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,6 +1286,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			if (has_nonram_online()) {
 				list_add(&page->lru, &demote_pages);
+				if (PageTransHuge(page))
+					stat->nr_demoted += HPAGE_PMD_NR;
+				else
+					stat->nr_demoted++;
 				unlock_page(page);
 				continue;
 			}
@@ -1523,7 +1527,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			putback_movable_pages(&demote_pages);
 
 			list_splice(&ret_pages, &demote_pages);
+
+			if (err > 0)
+				stat->nr_demoted -= err;
+			else
+				stat->nr_demoted = 0;
 		}
+
+		if (current_is_kswapd())
+			__count_vm_events(PGDEMOTE_KSWAPD, stat->nr_demoted);
+		else
+			__count_vm_events(PGDEMOTE_DIRECT, stat->nr_demoted);
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 36b56f8..0e863e7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1192,6 +1192,8 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"pgrefill",
 	"pgsteal_kswapd",
 	"pgsteal_direct",
+	"pgdemote_kswapd",
+	"pgdemote_direct",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_direct_throttle",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/10] mm: numa: add page promotion counter
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (6 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 07/10] mm: vmscan: add page demotion counter Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 09/10] doc: add description for MPOL_HYBRID mode Yang Shi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Add counter for page promotion for NUMA balancing.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 include/linux/vm_event_item.h | 1 +
 mm/huge_memory.c              | 4 ++++
 mm/memory.c                   | 4 ++++
 mm/vmstat.c                   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 499a3aa..9f52a62 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -51,6 +51,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_PAGE_PROMOTE,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8268a3c..9d5f5ce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1607,6 +1607,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
 				vmf->pmd, pmd, vmf->address, page, target_nid);
 	if (migrated) {
+		if (!node_isset(page_nid, def_alloc_nodemask) &&
+		    node_isset(target_nid, def_alloc_nodemask))
+			count_vm_numa_events(NUMA_PAGE_PROMOTE, HPAGE_PMD_NR);
+
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
 	} else
diff --git a/mm/memory.c b/mm/memory.c
index 2494c11..554191b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3691,6 +3691,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	/* Migrate to the requested node */
 	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated) {
+		if (!node_isset(page_nid, def_alloc_nodemask) &&
+		    node_isset(target_nid, def_alloc_nodemask))
+			count_vm_numa_event(NUMA_PAGE_PROMOTE);
+
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
 	} else
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0e863e7..4b44fc8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1220,6 +1220,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_pages_promoted",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/10] doc: add description for MPOL_HYBRID mode
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (7 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 08/10] mm: numa: add page promotion counter Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-23  4:44 ` [PATCH 10/10] doc: elaborate the PMEM allocation rule Yang Shi
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

Add description for MPOL_HYBRID mode in kernel documentation.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/admin-guide/mm/numa_memory_policy.rst | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index d78c5b3..3db8257 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -198,6 +198,16 @@ MPOL_BIND
 	the node in the set with sufficient free memory that is
 	closest to the node where the allocation takes place.
 
+MPOL_HYBRID
+        This mode specifies that the page allocation must happen on the
+        nodes specified by the policy.  If both DRAM and non-DRAM nodes
+        are specified, NUMA balancing may promote the pages from non-DRAM
+        nodes to the specified DRAM nodes.  If only non-DRAM nodes are
+        specified, NUMA balancing may promote the pages to any available
+        DRAM nodes.  Any other policy doesn't do such page promotion.  The
+        default mode may do NUMA balancing, but non-DRAM nodes are masked
+        off for default mode.
+
 MPOL_PREFERRED
 	This mode specifies that the allocation should be attempted
 	from the single node specified in the policy.  If that
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/10] doc: elaborate the PMEM allocation rule
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (8 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 09/10] doc: add description for MPOL_HYBRID mode Yang Shi
@ 2019-03-23  4:44 ` Yang Shi
  2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
  2019-03-26 13:58 ` Michal Hocko
  11 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-23  4:44 UTC (permalink / raw)
  To: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: yang.shi, linux-mm, linux-kernel

non-DRAM nodes are excluded from default allocation node mask, elaborate
the rules.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 Documentation/vm/numa.rst | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/numa.rst b/Documentation/vm/numa.rst
index 185d8a5..8c2fd5c 100644
--- a/Documentation/vm/numa.rst
+++ b/Documentation/vm/numa.rst
@@ -133,7 +133,7 @@ a subsystem allocates per CPU memory resources, for example.
 
 A typical model for making such an allocation is to obtain the node id of the
 node to which the "current CPU" is attached using one of the kernel's
-numa_node_id() or CPU_to_node() functions and then request memory from only
+numa_node_id() or cpu_to_node() functions and then request memory from only
 the node id returned.  When such an allocation fails, the requesting subsystem
 may revert to its own fallback path.  The slab kernel memory allocator is an
 example of this.  Or, the subsystem may choose to disable or not to enable
@@ -148,3 +148,8 @@ architectures transparently, kernel subsystems can use the numa_mem_id()
 or cpu_to_mem() function to locate the "local memory node" for the calling or
 specified CPU.  Again, this is the same node from which default, local page
 allocations will be attempted.
+
+If the architecture supports non-regular DRAM nodes, i.e. NVDIMM on x86, the
+non-DRAM nodes are hidden from default mode, IOWs the default allocation
+would not end up on non-DRAM nodes, unless thoes nodes are specified
+explicity by mempolicy. [see Documentation/vm/numa_memory_policy.txt.]
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
@ 2019-03-23  6:03   ` Zi Yan
  2019-03-25 21:49     ` Yang Shi
  2019-03-24 22:20   ` Keith Busch
  1 sibling, 1 reply; 66+ messages in thread
From: Zi Yan @ 2019-03-23  6:03 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6569 bytes --]

On 22 Mar 2019, at 21:44, Yang Shi wrote:

> Since PMEM provides larger capacity than DRAM and has much lower
> access latency than disk, so it is a good choice to use as a middle
> tier between DRAM and disk in page reclaim path.
>
> With PMEM nodes, the demotion path of anonymous pages could be:
>
> DRAM -> PMEM -> swap device
>
> This patch demotes anonymous pages only for the time being and demote
> THP to PMEM in a whole.  However this may cause expensive page reclaim
> and/or compaction on PMEM node if there is memory pressure on it.  But,
> considering the capacity of PMEM and allocation only happens on PMEM
> when PMEM is specified explicity, such cases should be not that often.
> So, it sounds worth keeping THP in a whole instead of splitting it.
>
> Demote pages to the cloest non-DRAM node even though the system is
> swapless.  The current logic of page reclaim just scan anon LRU when
> swap is on and swappiness is set properly.  Demoting to PMEM doesn't
> need care whether swap is available or not.  But, reclaiming from PMEM
> still skip anon LRU is swap is not available.
>
> The demotion just happens between DRAM node and its cloest PMEM node.
> Demoting to a remote PMEM node is not allowed for now.
>
> And, define a new migration reason for demotion, called MR_DEMOTE.
> Demote page via async migration to avoid blocking.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  include/linux/migrate.h        |  1 +
>  include/trace/events/migrate.h |  3 +-
>  mm/debug.c                     |  1 +
>  mm/internal.h                  | 22 ++++++++++
>  mm/vmscan.c                    | 99 ++++++++++++++++++++++++++++++++++--------
>  5 files changed, 107 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e13d9bf..78c8dda 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -25,6 +25,7 @@ enum migrate_reason {
>  	MR_MEMPOLICY_MBIND,
>  	MR_NUMA_MISPLACED,
>  	MR_CONTIG_RANGE,
> +	MR_DEMOTE,
>  	MR_TYPES
>  };
>
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 705b33d..c1d5b36 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
> -	EMe(MR_CONTIG_RANGE,	"contig_range")
> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
> +	EMe(MR_DEMOTE,		"demote")
>
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/debug.c b/mm/debug.c
> index c0b31b6..cc0d7df 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -25,6 +25,7 @@
>  	"mempolicy_mbind",
>  	"numa_misplaced",
>  	"cma",
> +	"demote",
>  };
>
>  const struct trace_print_flags pageflag_names[] = {
> diff --git a/mm/internal.h b/mm/internal.h
> index 46ad0d8..0152300 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -303,6 +303,19 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
>  }
>  #endif
>
> +static inline bool has_nonram_online(void)
> +{
> +	int i = 0;
> +
> +	for_each_online_node(i) {
> +		/* Have PMEM node online? */
> +		if (!node_isset(i, def_alloc_nodemask))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  /* mm/util.c */
>  void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
>  		struct vm_area_struct *prev, struct rb_node *rb_parent);
> @@ -565,5 +578,14 @@ static inline bool is_migrate_highatomic_page(struct page *page)
>  }
>
>  void setup_zone_pageset(struct zone *zone);
> +
> +#ifdef CONFIG_NUMA
>  extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> +#else
> +static inline struct page *alloc_new_node_page(struct page *page,
> +					       unsigned long node)
> +{
> +	return NULL;
> +}
> +#endif
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a5ad0b3..bdcab6b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1094,6 +1094,19 @@ static void page_check_dirty_writeback(struct page *page,
>  		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
>  }
>
> +static inline bool is_demote_ok(struct pglist_data *pgdat)
> +{
> +	/* Current node is not DRAM node */
> +	if (!node_isset(pgdat->node_id, def_alloc_nodemask))
> +		return false;
> +
> +	/* No online PMEM node */
> +	if (!has_nonram_online())
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
>   */
> @@ -1106,6 +1119,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> +	LIST_HEAD(demote_pages);
>  	unsigned nr_reclaimed = 0;
>
>  	memset(stat, 0, sizeof(*stat));
> @@ -1262,6 +1276,22 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>
>  		/*
> +		 * Demote DRAM pages regardless the mempolicy.
> +		 * Demot anonymous pages only for now and skip MADV_FREE

s/Demot/Demote

> +		 * pages.
> +		 */
> +		if (PageAnon(page) && !PageSwapCache(page) &&
> +		    (node_isset(page_to_nid(page), def_alloc_nodemask)) &&
> +		    PageSwapBacked(page)) {
> +
> +			if (has_nonram_online()) {
> +				list_add(&page->lru, &demote_pages);
> +				unlock_page(page);
> +				continue;
> +			}
> +		}
> +
> +		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
>  		 * Lazyfree page could be freed directly
> @@ -1477,6 +1507,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>  	}
>
> +	/* Demote pages to PMEM */
> +	if (!list_empty(&demote_pages)) {
> +		int err, target_nid;
> +		nodemask_t used_mask;
> +
> +		nodes_clear(used_mask);
> +		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
> +						 true);
> +
> +		err = migrate_pages(&demote_pages, alloc_new_node_page, NULL,
> +				    target_nid, MIGRATE_ASYNC, MR_DEMOTE);
> +
> +		if (err) {
> +			putback_movable_pages(&demote_pages);
> +
> +			list_splice(&ret_pages, &demote_pages);
> +		}
> +	}
> +

I like your approach here. It reuses the existing migrate_pages() interface without
adding extra code. I also would like to be CC’d in your future versions.

Thank you.

--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
@ 2019-03-23 17:21     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-23 17:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> When running applications on the machine with NVDIMM as NUMA node, the
> memory allocation may end up on NVDIMM node.  This may result in silent
> performance degradation and regression due to the difference of hardware
> property.
>
> DRAM first should be obeyed to prevent from surprising regression.  Any
> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> to control the memory placement.  Introduce def_alloc_nodemask which has
> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> NUMA policy explicitly.
>
> In the future we may be able to extract the memory charasteristics from
> HMAT or other source to build up the default allocation nodemask.
> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> for the time being.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  arch/x86/mm/numa.c     |  1 +
>  drivers/acpi/numa.c    |  8 ++++++++
>  include/linux/mmzone.h |  3 +++
>  mm/page_alloc.c        | 18 ++++++++++++++++--
>  4 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index dfb6c4d..d9e0ca4 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>         nodes_clear(numa_nodes_parsed);
>         nodes_clear(node_possible_map);
>         nodes_clear(node_online_map);
> +       nodes_clear(def_alloc_nodemask);
>         memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>         WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>                                   MAX_NUMNODES));
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 867f6e3..79dfedf 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>                 goto out_err_bad_srat;
>         }
>
> +       /*
> +        * Non volatile memory is excluded from zonelist by default.
> +        * Only regular DRAM nodes are set in default allocation node
> +        * mask.
> +        */
> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> +               node_set(node, def_alloc_nodemask);

Hmm, no, I don't think we should do this. Especially considering
current generation NVDIMMs are energy backed DRAM there is no
performance difference that should be assumed by the non-volatile
flag.

Why isn't default SLIT distance sufficient for ensuring a DRAM-first
default policy?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
@ 2019-03-23 17:21     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-23 17:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> When running applications on the machine with NVDIMM as NUMA node, the
> memory allocation may end up on NVDIMM node.  This may result in silent
> performance degradation and regression due to the difference of hardware
> property.
>
> DRAM first should be obeyed to prevent from surprising regression.  Any
> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> to control the memory placement.  Introduce def_alloc_nodemask which has
> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> NUMA policy explicitly.
>
> In the future we may be able to extract the memory charasteristics from
> HMAT or other source to build up the default allocation nodemask.
> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> for the time being.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  arch/x86/mm/numa.c     |  1 +
>  drivers/acpi/numa.c    |  8 ++++++++
>  include/linux/mmzone.h |  3 +++
>  mm/page_alloc.c        | 18 ++++++++++++++++--
>  4 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index dfb6c4d..d9e0ca4 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>         nodes_clear(numa_nodes_parsed);
>         nodes_clear(node_possible_map);
>         nodes_clear(node_online_map);
> +       nodes_clear(def_alloc_nodemask);
>         memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>         WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>                                   MAX_NUMNODES));
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 867f6e3..79dfedf 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>                 goto out_err_bad_srat;
>         }
>
> +       /*
> +        * Non volatile memory is excluded from zonelist by default.
> +        * Only regular DRAM nodes are set in default allocation node
> +        * mask.
> +        */
> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> +               node_set(node, def_alloc_nodemask);

Hmm, no, I don't think we should do this. Especially considering
current generation NVDIMMs are energy backed DRAM there is no
performance difference that should be assumed by the non-volatile
flag.

Why isn't default SLIT distance sufficient for ensuring a DRAM-first
default policy?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
  2019-03-23  6:03   ` Zi Yan
@ 2019-03-24 22:20   ` Keith Busch
  2019-03-25 19:49     ` Yang Shi
  1 sibling, 1 reply; 66+ messages in thread
From: Keith Busch @ 2019-03-24 22:20 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

On Sat, Mar 23, 2019 at 12:44:31PM +0800, Yang Shi wrote:
>  		/*
> +		 * Demote DRAM pages regardless the mempolicy.
> +		 * Demot anonymous pages only for now and skip MADV_FREE
> +		 * pages.
> +		 */
> +		if (PageAnon(page) && !PageSwapCache(page) &&
> +		    (node_isset(page_to_nid(page), def_alloc_nodemask)) &&
> +		    PageSwapBacked(page)) {
> +
> +			if (has_nonram_online()) {
> +				list_add(&page->lru, &demote_pages);
> +				unlock_page(page);
> +				continue;
> +			}
> +		}
> +
> +		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
>  		 * Lazyfree page could be freed directly
> @@ -1477,6 +1507,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>  	}
>  
> +	/* Demote pages to PMEM */
> +	if (!list_empty(&demote_pages)) {
> +		int err, target_nid;
> +		nodemask_t used_mask;
> +
> +		nodes_clear(used_mask);
> +		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
> +						 true);
> +
> +		err = migrate_pages(&demote_pages, alloc_new_node_page, NULL,
> +				    target_nid, MIGRATE_ASYNC, MR_DEMOTE);
> +
> +		if (err) {
> +			putback_movable_pages(&demote_pages);
> +
> +			list_splice(&ret_pages, &demote_pages);
> +		}
> +	}
> +
>  	mem_cgroup_uncharge_list(&free_pages);
>  	try_to_unmap_flush();
>  	free_unref_page_list(&free_pages);

How do these pages eventually get to swap when migration fails? Looks
like that's skipped.

And page cache demotion is useful too, we shouldn't consider only
anonymous for this feature.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (9 preceding siblings ...)
  2019-03-23  4:44 ` [PATCH 10/10] doc: elaborate the PMEM allocation rule Yang Shi
@ 2019-03-25 16:15 ` Brice Goglin
  2019-03-25 16:56     ` Dan Williams
  2019-03-25 20:04   ` Yang Shi
  2019-03-26 13:58 ` Michal Hocko
  11 siblings, 2 replies; 66+ messages in thread
From: Brice Goglin @ 2019-03-25 16:15 UTC (permalink / raw)
  To: Yang Shi, mhocko, mgorman, riel, hannes, akpm, dave.hansen,
	keith.busch, dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: linux-mm, linux-kernel


Le 23/03/2019 à 05:44, Yang Shi a écrit :
> With Dave Hansen's patches merged into Linus's tree
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>
> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> effectively and efficiently is still a question. 
>
> There have been a couple of proposals posted on the mailing list [1] [2].
>
> The patchset is aimed to try a different approach from this proposal [1]
> to use PMEM as NUMA nodes.
>
> The approach is designed to follow the below principles:
>
> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>
> 2. DRAM first/by default. No surprise to existing applications and default
> running. PMEM will not be allocated unless its node is specified explicitly
> by NUMA policy. Some applications may be not very sensitive to memory latency,
> so they could be placed on PMEM nodes then have hot pages promote to DRAM
> gradually.


I am not against the approach for some workloads. However, many HPC
people would rather do this manually. But there's currently no easy way
to find out from userspace whether a given NUMA node is DDR or PMEM*. We
have to assume HMAT is available (and correct) and look at performance
attributes. When talking to humans, it would be better to say "I
allocated on the local DDR NUMA node" rather than "I allocated on the
fastest node according to HMAT latency".

Also, when we'll have HBM+DDR, some applications may want to use DDR by
default, which means they want the *slowest* node according to HMAT (by
the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
Performance attributes could help, but how does user-space know for sure
that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

It seems to me that exporting a flag in sysfs saying whether a node is
PMEM could be convenient. Patch series [1] exported a "type" in sysfs
node directories ("pmem" or "dram"). I don't know how if there's an easy
way to define what HBM is and expose that type too.

Brice

* As far as I know, the only way is to look at all DAX devices until you
find the given NUMA node in the "target_node" attribute. If none, you're
likely not PMEM-backed.


> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
@ 2019-03-25 16:56     ` Dan Williams
  2019-03-25 20:04   ` Yang Shi
  1 sibling, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 16:56 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 23/03/2019 à 05:44, Yang Shi a écrit :
> > With Dave Hansen's patches merged into Linus's tree
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> >
> > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > effectively and efficiently is still a question.
> >
> > There have been a couple of proposals posted on the mailing list [1] [2].
> >
> > The patchset is aimed to try a different approach from this proposal [1]
> > to use PMEM as NUMA nodes.
> >
> > The approach is designed to follow the below principles:
> >
> > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> >
> > 2. DRAM first/by default. No surprise to existing applications and default
> > running. PMEM will not be allocated unless its node is specified explicitly
> > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > gradually.
>
>
> I am not against the approach for some workloads. However, many HPC
> people would rather do this manually. But there's currently no easy way
> to find out from userspace whether a given NUMA node is DDR or PMEM*. We
> have to assume HMAT is available (and correct) and look at performance
> attributes. When talking to humans, it would be better to say "I
> allocated on the local DDR NUMA node" rather than "I allocated on the
> fastest node according to HMAT latency".
>
> Also, when we'll have HBM+DDR, some applications may want to use DDR by
> default, which means they want the *slowest* node according to HMAT (by
> the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
> Performance attributes could help, but how does user-space know for sure
> that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?
>
> It seems to me that exporting a flag in sysfs saying whether a node is
> PMEM could be convenient. Patch series [1] exported a "type" in sysfs
> node directories ("pmem" or "dram"). I don't know how if there's an easy
> way to define what HBM is and expose that type too.

I'm generally against the concept that a "pmem" or "type" flag should
indicate anything about the expected performance of the address range.
The kernel should explicitly look to the HMAT for performance data and
not otherwise make type-based performance assumptions.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-25 16:56     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 16:56 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 23/03/2019 à 05:44, Yang Shi a écrit :
> > With Dave Hansen's patches merged into Linus's tree
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> >
> > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > effectively and efficiently is still a question.
> >
> > There have been a couple of proposals posted on the mailing list [1] [2].
> >
> > The patchset is aimed to try a different approach from this proposal [1]
> > to use PMEM as NUMA nodes.
> >
> > The approach is designed to follow the below principles:
> >
> > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> >
> > 2. DRAM first/by default. No surprise to existing applications and default
> > running. PMEM will not be allocated unless its node is specified explicitly
> > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > gradually.
>
>
> I am not against the approach for some workloads. However, many HPC
> people would rather do this manually. But there's currently no easy way
> to find out from userspace whether a given NUMA node is DDR or PMEM*. We
> have to assume HMAT is available (and correct) and look at performance
> attributes. When talking to humans, it would be better to say "I
> allocated on the local DDR NUMA node" rather than "I allocated on the
> fastest node according to HMAT latency".
>
> Also, when we'll have HBM+DDR, some applications may want to use DDR by
> default, which means they want the *slowest* node according to HMAT (by
> the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
> Performance attributes could help, but how does user-space know for sure
> that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?
>
> It seems to me that exporting a flag in sysfs saying whether a node is
> PMEM could be convenient. Patch series [1] exported a "type" in sysfs
> node directories ("pmem" or "dram"). I don't know how if there's an easy
> way to define what HBM is and expose that type too.

I'm generally against the concept that a "pmem" or "type" flag should
indicate anything about the expected performance of the address range.
The kernel should explicitly look to the HMAT for performance data and
not otherwise make type-based performance assumptions.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 16:56     ` Dan Williams
  (?)
@ 2019-03-25 17:45     ` Brice Goglin
  2019-03-25 19:29         ` Dan Williams
  -1 siblings, 1 reply; 66+ messages in thread
From: Brice Goglin @ 2019-03-25 17:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

Le 25/03/2019 à 17:56, Dan Williams a écrit :
>
> I'm generally against the concept that a "pmem" or "type" flag should
> indicate anything about the expected performance of the address range.
> The kernel should explicitly look to the HMAT for performance data and
> not otherwise make type-based performance assumptions.


Oh sorry, I didn't mean to have the kernel use such a flag to decide of
placement, but rather to expose more information to userspace to clarify
what all these nodes are about when userspace will decide where to
allocate things.

I understand that current NVDIMM-F are not slower than DDR and HMAT
would better describe this than a flag. But I have seen so many buggy or
dummy SLIT tables in the past that I wonder if we can expect HMAT to be
widely available (and correct).

Is there a safe fallback in case of missing or buggy HMAT? For instance,
is DDR supposed to be listed before NVDIMM (or HBM) in SRAT?

Brice



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-23 17:21     ` Dan Williams
  (?)
@ 2019-03-25 19:28     ` Yang Shi
  2019-03-25 23:18         ` Dan Williams
  -1 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-25 19:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List



On 3/23/19 10:21 AM, Dan Williams wrote:
> On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>> When running applications on the machine with NVDIMM as NUMA node, the
>> memory allocation may end up on NVDIMM node.  This may result in silent
>> performance degradation and regression due to the difference of hardware
>> property.
>>
>> DRAM first should be obeyed to prevent from surprising regression.  Any
>> non-DRAM nodes should be excluded from default allocation.  Use nodemask
>> to control the memory placement.  Introduce def_alloc_nodemask which has
>> DRAM nodes set only.  Any non-DRAM allocation should be specified by
>> NUMA policy explicitly.
>>
>> In the future we may be able to extract the memory charasteristics from
>> HMAT or other source to build up the default allocation nodemask.
>> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
>> for the time being.
>>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   arch/x86/mm/numa.c     |  1 +
>>   drivers/acpi/numa.c    |  8 ++++++++
>>   include/linux/mmzone.h |  3 +++
>>   mm/page_alloc.c        | 18 ++++++++++++++++--
>>   4 files changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index dfb6c4d..d9e0ca4 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>>          nodes_clear(numa_nodes_parsed);
>>          nodes_clear(node_possible_map);
>>          nodes_clear(node_online_map);
>> +       nodes_clear(def_alloc_nodemask);
>>          memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>>          WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>>                                    MAX_NUMNODES));
>> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
>> index 867f6e3..79dfedf 100644
>> --- a/drivers/acpi/numa.c
>> +++ b/drivers/acpi/numa.c
>> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>                  goto out_err_bad_srat;
>>          }
>>
>> +       /*
>> +        * Non volatile memory is excluded from zonelist by default.
>> +        * Only regular DRAM nodes are set in default allocation node
>> +        * mask.
>> +        */
>> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
>> +               node_set(node, def_alloc_nodemask);
> Hmm, no, I don't think we should do this. Especially considering
> current generation NVDIMMs are energy backed DRAM there is no
> performance difference that should be assumed by the non-volatile
> flag.

Actually, here I would like to initialize a node mask for default 
allocation. Memory allocation should not end up on any nodes excluded by 
this node mask unless they are specified by mempolicy.

We may have a few different ways or criteria to initialize the node 
mask, for example, we can read from HMAT (when HMAT is ready in the 
future), and we definitely could have non-DRAM nodes set if they have no 
performance difference (I'm supposed you mean NVDIMM-F  or HBM).

As long as there are different tiers, distinguished by performance, for 
main memory, IMHO, there should be a defined default allocation node 
mask to control the memory placement no matter where we get the information.

But, for now we haven't had such information ready for such use yet, so 
the SRAT flag might be a choice.

>
> Why isn't default SLIT distance sufficient for ensuring a DRAM-first
> default policy?

"DRAM-first" may sound ambiguous, actually I mean "DRAM only by 
default". SLIT should just can tell us what node is local what node is 
remote, but can't tell us the performance difference.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 17:45     ` Brice Goglin
@ 2019-03-25 19:29         ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 19:29 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 10:45 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 25/03/2019 à 17:56, Dan Williams a écrit :
> >
> > I'm generally against the concept that a "pmem" or "type" flag should
> > indicate anything about the expected performance of the address range.
> > The kernel should explicitly look to the HMAT for performance data and
> > not otherwise make type-based performance assumptions.
>
>
> Oh sorry, I didn't mean to have the kernel use such a flag to decide of
> placement, but rather to expose more information to userspace to clarify
> what all these nodes are about when userspace will decide where to
> allocate things.

I understand, but I'm concerned about the risk of userspace developing
vendor-specific, or generation-specific policies around a coarse type
identifier. I think the lack of type specificity is a feature rather
than a gap, because it requires userspace to consider deeper
information.

Perhaps "path" might be a suitable replacement identifier rather than
type. I.e. memory that originates from an ACPI.NFIT root device is
likely "pmem".

> I understand that current NVDIMM-F are not slower than DDR and HMAT
> would better describe this than a flag. But I have seen so many buggy or
> dummy SLIT tables in the past that I wonder if we can expect HMAT to be
> widely available (and correct).

That's always a fear that the platform BIOS will try to game OS
behavior. However, that was the reason that HMAT was defined to
indicate actual performance values rather than relative. It is
hopefully harder to game than the relative SLIT values, but I'l  grant
you it's now impossible.

> Is there a safe fallback in case of missing or buggy HMAT? For instance,
> is DDR supposed to be listed before NVDIMM (or HBM) in SRAT?

One fallback might be to make some of these sysfs attributes writable
so userspace can correct the situation, but I'm otherwise unclear of
what you mean by "safe". If a platform has hard dependencies on
correctly enumerating memory performance capabilities then there's not
much the kernel can do if the HMAT is botched. I would expect the
general case is that the performance capabilities are a soft
dependency. but things still work if the data is wrong.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-25 19:29         ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 19:29 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 10:45 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 25/03/2019 à 17:56, Dan Williams a écrit :
> >
> > I'm generally against the concept that a "pmem" or "type" flag should
> > indicate anything about the expected performance of the address range.
> > The kernel should explicitly look to the HMAT for performance data and
> > not otherwise make type-based performance assumptions.
>
>
> Oh sorry, I didn't mean to have the kernel use such a flag to decide of
> placement, but rather to expose more information to userspace to clarify
> what all these nodes are about when userspace will decide where to
> allocate things.

I understand, but I'm concerned about the risk of userspace developing
vendor-specific, or generation-specific policies around a coarse type
identifier. I think the lack of type specificity is a feature rather
than a gap, because it requires userspace to consider deeper
information.

Perhaps "path" might be a suitable replacement identifier rather than
type. I.e. memory that originates from an ACPI.NFIT root device is
likely "pmem".

> I understand that current NVDIMM-F are not slower than DDR and HMAT
> would better describe this than a flag. But I have seen so many buggy or
> dummy SLIT tables in the past that I wonder if we can expect HMAT to be
> widely available (and correct).

That's always a fear that the platform BIOS will try to game OS
behavior. However, that was the reason that HMAT was defined to
indicate actual performance values rather than relative. It is
hopefully harder to game than the relative SLIT values, but I'l  grant
you it's now impossible.

> Is there a safe fallback in case of missing or buggy HMAT? For instance,
> is DDR supposed to be listed before NVDIMM (or HBM) in SRAT?

One fallback might be to make some of these sysfs attributes writable
so userspace can correct the situation, but I'm otherwise unclear of
what you mean by "safe". If a platform has hard dependencies on
correctly enumerating memory performance capabilities then there's not
much the kernel can do if the HMAT is botched. I would expect the
general case is that the performance capabilities are a soft
dependency. but things still work if the data is wrong.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-24 22:20   ` Keith Busch
@ 2019-03-25 19:49     ` Yang Shi
  2019-03-27  0:35       ` Keith Busch
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-25 19:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel



On 3/24/19 3:20 PM, Keith Busch wrote:
> On Sat, Mar 23, 2019 at 12:44:31PM +0800, Yang Shi wrote:
>>   		/*
>> +		 * Demote DRAM pages regardless the mempolicy.
>> +		 * Demot anonymous pages only for now and skip MADV_FREE
>> +		 * pages.
>> +		 */
>> +		if (PageAnon(page) && !PageSwapCache(page) &&
>> +		    (node_isset(page_to_nid(page), def_alloc_nodemask)) &&
>> +		    PageSwapBacked(page)) {
>> +
>> +			if (has_nonram_online()) {
>> +				list_add(&page->lru, &demote_pages);
>> +				unlock_page(page);
>> +				continue;
>> +			}
>> +		}
>> +
>> +		/*
>>   		 * Anonymous process memory has backing store?
>>   		 * Try to allocate it some swap space here.
>>   		 * Lazyfree page could be freed directly
>> @@ -1477,6 +1507,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>   		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>   	}
>>   
>> +	/* Demote pages to PMEM */
>> +	if (!list_empty(&demote_pages)) {
>> +		int err, target_nid;
>> +		nodemask_t used_mask;
>> +
>> +		nodes_clear(used_mask);
>> +		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
>> +						 true);
>> +
>> +		err = migrate_pages(&demote_pages, alloc_new_node_page, NULL,
>> +				    target_nid, MIGRATE_ASYNC, MR_DEMOTE);
>> +
>> +		if (err) {
>> +			putback_movable_pages(&demote_pages);
>> +
>> +			list_splice(&ret_pages, &demote_pages);
>> +		}
>> +	}
>> +
>>   	mem_cgroup_uncharge_list(&free_pages);
>>   	try_to_unmap_flush();
>>   	free_unref_page_list(&free_pages);
> How do these pages eventually get to swap when migration fails? Looks
> like that's skipped.

Yes, they will be just put back to LRU. Actually, I don't expect it 
would be very often to have migration fail at this stage (but I have no 
test data to support this hypothesis) since the pages have been isolated 
from LRU, so other reclaim path should not find them anymore.

If it is locked by someone else right before migration, it is likely 
referenced again, so putting back to LRU sounds not bad.

A potential improvement is to have sync migration for kswapd.

>
> And page cache demotion is useful too, we shouldn't consider only
> anonymous for this feature.

Yes, definitely. I'm looking into the page cache case now. Any 
suggestion is welcome.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
  2019-03-25 16:56     ` Dan Williams
@ 2019-03-25 20:04   ` Yang Shi
  1 sibling, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-25 20:04 UTC (permalink / raw)
  To: Brice Goglin, mhocko, mgorman, riel, hannes, akpm, dave.hansen,
	keith.busch, dan.j.williams, fengguang.wu, fan.du, ying.huang
  Cc: linux-mm, linux-kernel



On 3/25/19 9:15 AM, Brice Goglin wrote:
> Le 23/03/2019 à 05:44, Yang Shi a écrit :
>> With Dave Hansen's patches merged into Linus's tree
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>
>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>> effectively and efficiently is still a question.
>>
>> There have been a couple of proposals posted on the mailing list [1] [2].
>>
>> The patchset is aimed to try a different approach from this proposal [1]
>> to use PMEM as NUMA nodes.
>>
>> The approach is designed to follow the below principles:
>>
>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>
>> 2. DRAM first/by default. No surprise to existing applications and default
>> running. PMEM will not be allocated unless its node is specified explicitly
>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>> gradually.
>
> I am not against the approach for some workloads. However, many HPC
> people would rather do this manually. But there's currently no easy way
> to find out from userspace whether a given NUMA node is DDR or PMEM*. We
> have to assume HMAT is available (and correct) and look at performance
> attributes. When talking to humans, it would be better to say "I
> allocated on the local DDR NUMA node" rather than "I allocated on the
> fastest node according to HMAT latency".

Yes, I agree to have some information exposed to kernel or userspace to 
tell what nodes are DRAM nodes what nodes are not (maybe HBM or PMEM). I 
assume the default allocation should end up on DRAM nodes for the most 
workloads. If someone would like to control this manually other than 
mempolicy, the default allocation node mask may be exported to user 
space by sysfs so that it can be changed on demand.

>
> Also, when we'll have HBM+DDR, some applications may want to use DDR by
> default, which means they want the *slowest* node according to HMAT (by
> the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
> Performance attributes could help, but how does user-space know for sure
> that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

This is what I mentioned above we need the information exported from 
HMAT or anything similar to tell us what nodes are DRAM nodes since DRAM 
may be the lowest tier memory.

Or we may be able to assume the nodes associated with CPUs are DRAM 
nodes by assuming both HBM and PMEM is CPU less node.

Thanks,
Yang

>
> It seems to me that exporting a flag in sysfs saying whether a node is
> PMEM could be convenient. Patch series [1] exported a "type" in sysfs
> node directories ("pmem" or "dram"). I don't know how if there's an easy
> way to define what HBM is and expose that type too.
>
> Brice
>
> * As far as I know, the only way is to look at all DAX devices until you
> find the given NUMA node in the "target_node" attribute. If none, you're
> likely not PMEM-backed.
>
>
>> [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-23  6:03   ` Zi Yan
@ 2019-03-25 21:49     ` Yang Shi
  0 siblings, 0 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-25 21:49 UTC (permalink / raw)
  To: Zi Yan
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel



On 3/22/19 11:03 PM, Zi Yan wrote:
> On 22 Mar 2019, at 21:44, Yang Shi wrote:
>
>> Since PMEM provides larger capacity than DRAM and has much lower
>> access latency than disk, so it is a good choice to use as a middle
>> tier between DRAM and disk in page reclaim path.
>>
>> With PMEM nodes, the demotion path of anonymous pages could be:
>>
>> DRAM -> PMEM -> swap device
>>
>> This patch demotes anonymous pages only for the time being and demote
>> THP to PMEM in a whole.  However this may cause expensive page reclaim
>> and/or compaction on PMEM node if there is memory pressure on it.  But,
>> considering the capacity of PMEM and allocation only happens on PMEM
>> when PMEM is specified explicity, such cases should be not that often.
>> So, it sounds worth keeping THP in a whole instead of splitting it.
>>
>> Demote pages to the cloest non-DRAM node even though the system is
>> swapless.  The current logic of page reclaim just scan anon LRU when
>> swap is on and swappiness is set properly.  Demoting to PMEM doesn't
>> need care whether swap is available or not.  But, reclaiming from PMEM
>> still skip anon LRU is swap is not available.
>>
>> The demotion just happens between DRAM node and its cloest PMEM node.
>> Demoting to a remote PMEM node is not allowed for now.
>>
>> And, define a new migration reason for demotion, called MR_DEMOTE.
>> Demote page via async migration to avoid blocking.
>>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   include/linux/migrate.h        |  1 +
>>   include/trace/events/migrate.h |  3 +-
>>   mm/debug.c                     |  1 +
>>   mm/internal.h                  | 22 ++++++++++
>>   mm/vmscan.c                    | 99 ++++++++++++++++++++++++++++++++++--------
>>   5 files changed, 107 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index e13d9bf..78c8dda 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>   	MR_MEMPOLICY_MBIND,
>>   	MR_NUMA_MISPLACED,
>>   	MR_CONTIG_RANGE,
>> +	MR_DEMOTE,
>>   	MR_TYPES
>>   };
>>
>> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
>> index 705b33d..c1d5b36 100644
>> --- a/include/trace/events/migrate.h
>> +++ b/include/trace/events/migrate.h
>> @@ -20,7 +20,8 @@
>>   	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>>   	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>>   	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
>> -	EMe(MR_CONTIG_RANGE,	"contig_range")
>> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
>> +	EMe(MR_DEMOTE,		"demote")
>>
>>   /*
>>    * First define the enums in the above macros to be exported to userspace
>> diff --git a/mm/debug.c b/mm/debug.c
>> index c0b31b6..cc0d7df 100644
>> --- a/mm/debug.c
>> +++ b/mm/debug.c
>> @@ -25,6 +25,7 @@
>>   	"mempolicy_mbind",
>>   	"numa_misplaced",
>>   	"cma",
>> +	"demote",
>>   };
>>
>>   const struct trace_print_flags pageflag_names[] = {
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 46ad0d8..0152300 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -303,6 +303,19 @@ static inline int find_next_best_node(int node, nodemask_t *used_node_mask,
>>   }
>>   #endif
>>
>> +static inline bool has_nonram_online(void)
>> +{
>> +	int i = 0;
>> +
>> +	for_each_online_node(i) {
>> +		/* Have PMEM node online? */
>> +		if (!node_isset(i, def_alloc_nodemask))
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>>   /* mm/util.c */
>>   void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
>>   		struct vm_area_struct *prev, struct rb_node *rb_parent);
>> @@ -565,5 +578,14 @@ static inline bool is_migrate_highatomic_page(struct page *page)
>>   }
>>
>>   void setup_zone_pageset(struct zone *zone);
>> +
>> +#ifdef CONFIG_NUMA
>>   extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
>> +#else
>> +static inline struct page *alloc_new_node_page(struct page *page,
>> +					       unsigned long node)
>> +{
>> +	return NULL;
>> +}
>> +#endif
>>   #endif	/* __MM_INTERNAL_H */
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index a5ad0b3..bdcab6b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1094,6 +1094,19 @@ static void page_check_dirty_writeback(struct page *page,
>>   		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
>>   }
>>
>> +static inline bool is_demote_ok(struct pglist_data *pgdat)
>> +{
>> +	/* Current node is not DRAM node */
>> +	if (!node_isset(pgdat->node_id, def_alloc_nodemask))
>> +		return false;
>> +
>> +	/* No online PMEM node */
>> +	if (!has_nonram_online())
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>   /*
>>    * shrink_page_list() returns the number of reclaimed pages
>>    */
>> @@ -1106,6 +1119,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>   {
>>   	LIST_HEAD(ret_pages);
>>   	LIST_HEAD(free_pages);
>> +	LIST_HEAD(demote_pages);
>>   	unsigned nr_reclaimed = 0;
>>
>>   	memset(stat, 0, sizeof(*stat));
>> @@ -1262,6 +1276,22 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>   		}
>>
>>   		/*
>> +		 * Demote DRAM pages regardless the mempolicy.
>> +		 * Demot anonymous pages only for now and skip MADV_FREE
> s/Demot/Demote

Thanks for catching this. Will fix.

>
>> +		 * pages.
>> +		 */
>> +		if (PageAnon(page) && !PageSwapCache(page) &&
>> +		    (node_isset(page_to_nid(page), def_alloc_nodemask)) &&
>> +		    PageSwapBacked(page)) {
>> +
>> +			if (has_nonram_online()) {
>> +				list_add(&page->lru, &demote_pages);
>> +				unlock_page(page);
>> +				continue;
>> +			}
>> +		}
>> +
>> +		/*
>>   		 * Anonymous process memory has backing store?
>>   		 * Try to allocate it some swap space here.
>>   		 * Lazyfree page could be freed directly
>> @@ -1477,6 +1507,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>   		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>   	}
>>
>> +	/* Demote pages to PMEM */
>> +	if (!list_empty(&demote_pages)) {
>> +		int err, target_nid;
>> +		nodemask_t used_mask;
>> +
>> +		nodes_clear(used_mask);
>> +		target_nid = find_next_best_node(pgdat->node_id, &used_mask,
>> +						 true);
>> +
>> +		err = migrate_pages(&demote_pages, alloc_new_node_page, NULL,
>> +				    target_nid, MIGRATE_ASYNC, MR_DEMOTE);
>> +
>> +		if (err) {
>> +			putback_movable_pages(&demote_pages);
>> +
>> +			list_splice(&ret_pages, &demote_pages);
>> +		}
>> +	}
>> +
> I like your approach here. It reuses the existing migrate_pages() interface without
> adding extra code. I also would like to be CC’d in your future versions.

Yes, sure.

Thanks,
Yang

>
> Thank you.
>
> --
> Best Regards,
> Yan Zi


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 19:29         ` Dan Williams
  (?)
@ 2019-03-25 23:09         ` Brice Goglin
  2019-03-25 23:37             ` Dan Williams
  -1 siblings, 1 reply; 66+ messages in thread
From: Brice Goglin @ 2019-03-25 23:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List


Le 25/03/2019 à 20:29, Dan Williams a écrit :
> Perhaps "path" might be a suitable replacement identifier rather than
> type. I.e. memory that originates from an ACPI.NFIT root device is
> likely "pmem".


Could work.

What kind of "path" would we get for other types of memory? (DDR,
non-ACPI-based based PMEM if any, NVMe PMR?)

Thanks

Brice


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-25 19:28     ` Yang Shi
@ 2019-03-25 23:18         ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List, Vishal L Verma

On Mon, Mar 25, 2019 at 12:28 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 3/23/19 10:21 AM, Dan Williams wrote:
> > On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> When running applications on the machine with NVDIMM as NUMA node, the
> >> memory allocation may end up on NVDIMM node.  This may result in silent
> >> performance degradation and regression due to the difference of hardware
> >> property.
> >>
> >> DRAM first should be obeyed to prevent from surprising regression.  Any
> >> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> >> to control the memory placement.  Introduce def_alloc_nodemask which has
> >> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> >> NUMA policy explicitly.
> >>
> >> In the future we may be able to extract the memory charasteristics from
> >> HMAT or other source to build up the default allocation nodemask.
> >> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> >> for the time being.
> >>
> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> >> ---
> >>   arch/x86/mm/numa.c     |  1 +
> >>   drivers/acpi/numa.c    |  8 ++++++++
> >>   include/linux/mmzone.h |  3 +++
> >>   mm/page_alloc.c        | 18 ++++++++++++++++--
> >>   4 files changed, 28 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index dfb6c4d..d9e0ca4 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
> >>          nodes_clear(numa_nodes_parsed);
> >>          nodes_clear(node_possible_map);
> >>          nodes_clear(node_online_map);
> >> +       nodes_clear(def_alloc_nodemask);
> >>          memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> >>          WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> >>                                    MAX_NUMNODES));
> >> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> >> index 867f6e3..79dfedf 100644
> >> --- a/drivers/acpi/numa.c
> >> +++ b/drivers/acpi/numa.c
> >> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
> >>                  goto out_err_bad_srat;
> >>          }
> >>
> >> +       /*
> >> +        * Non volatile memory is excluded from zonelist by default.
> >> +        * Only regular DRAM nodes are set in default allocation node
> >> +        * mask.
> >> +        */
> >> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> >> +               node_set(node, def_alloc_nodemask);
> > Hmm, no, I don't think we should do this. Especially considering
> > current generation NVDIMMs are energy backed DRAM there is no
> > performance difference that should be assumed by the non-volatile
> > flag.
>
> Actually, here I would like to initialize a node mask for default
> allocation. Memory allocation should not end up on any nodes excluded by
> this node mask unless they are specified by mempolicy.
>
> We may have a few different ways or criteria to initialize the node
> mask, for example, we can read from HMAT (when HMAT is ready in the
> future), and we definitely could have non-DRAM nodes set if they have no
> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
>
> As long as there are different tiers, distinguished by performance, for
> main memory, IMHO, there should be a defined default allocation node
> mask to control the memory placement no matter where we get the information.

I understand the intent, but I don't think the kernel should have such
a hardline policy by default. However, it would be worthwhile
mechanism and policy to consider for the dax-hotplug userspace
tooling. I.e. arrange for a given device-dax instance to be onlined,
but set the policy to require explicit opt-in by numa binding for it
to be an allocation / migration option.

I added Vishal to the cc who is looking into such policy tooling.

> But, for now we haven't had such information ready for such use yet, so
> the SRAT flag might be a choice.
>
> >
> > Why isn't default SLIT distance sufficient for ensuring a DRAM-first
> > default policy?
>
> "DRAM-first" may sound ambiguous, actually I mean "DRAM only by
> default". SLIT should just can tell us what node is local what node is
> remote, but can't tell us the performance difference.

I think it's a useful semantic, but let's leave the selection of that
policy to an explicit userspace decision.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
@ 2019-03-25 23:18         ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List, Vishal L Verma

On Mon, Mar 25, 2019 at 12:28 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
>
> On 3/23/19 10:21 AM, Dan Williams wrote:
> > On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> >> When running applications on the machine with NVDIMM as NUMA node, the
> >> memory allocation may end up on NVDIMM node.  This may result in silent
> >> performance degradation and regression due to the difference of hardware
> >> property.
> >>
> >> DRAM first should be obeyed to prevent from surprising regression.  Any
> >> non-DRAM nodes should be excluded from default allocation.  Use nodemask
> >> to control the memory placement.  Introduce def_alloc_nodemask which has
> >> DRAM nodes set only.  Any non-DRAM allocation should be specified by
> >> NUMA policy explicitly.
> >>
> >> In the future we may be able to extract the memory charasteristics from
> >> HMAT or other source to build up the default allocation nodemask.
> >> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
> >> for the time being.
> >>
> >> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> >> ---
> >>   arch/x86/mm/numa.c     |  1 +
> >>   drivers/acpi/numa.c    |  8 ++++++++
> >>   include/linux/mmzone.h |  3 +++
> >>   mm/page_alloc.c        | 18 ++++++++++++++++--
> >>   4 files changed, 28 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index dfb6c4d..d9e0ca4 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
> >>          nodes_clear(numa_nodes_parsed);
> >>          nodes_clear(node_possible_map);
> >>          nodes_clear(node_online_map);
> >> +       nodes_clear(def_alloc_nodemask);
> >>          memset(&numa_meminfo, 0, sizeof(numa_meminfo));
> >>          WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
> >>                                    MAX_NUMNODES));
> >> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> >> index 867f6e3..79dfedf 100644
> >> --- a/drivers/acpi/numa.c
> >> +++ b/drivers/acpi/numa.c
> >> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
> >>                  goto out_err_bad_srat;
> >>          }
> >>
> >> +       /*
> >> +        * Non volatile memory is excluded from zonelist by default.
> >> +        * Only regular DRAM nodes are set in default allocation node
> >> +        * mask.
> >> +        */
> >> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
> >> +               node_set(node, def_alloc_nodemask);
> > Hmm, no, I don't think we should do this. Especially considering
> > current generation NVDIMMs are energy backed DRAM there is no
> > performance difference that should be assumed by the non-volatile
> > flag.
>
> Actually, here I would like to initialize a node mask for default
> allocation. Memory allocation should not end up on any nodes excluded by
> this node mask unless they are specified by mempolicy.
>
> We may have a few different ways or criteria to initialize the node
> mask, for example, we can read from HMAT (when HMAT is ready in the
> future), and we definitely could have non-DRAM nodes set if they have no
> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
>
> As long as there are different tiers, distinguished by performance, for
> main memory, IMHO, there should be a defined default allocation node
> mask to control the memory placement no matter where we get the information.

I understand the intent, but I don't think the kernel should have such
a hardline policy by default. However, it would be worthwhile
mechanism and policy to consider for the dax-hotplug userspace
tooling. I.e. arrange for a given device-dax instance to be onlined,
but set the policy to require explicit opt-in by numa binding for it
to be an allocation / migration option.

I added Vishal to the cc who is looking into such policy tooling.

> But, for now we haven't had such information ready for such use yet, so
> the SRAT flag might be a choice.
>
> >
> > Why isn't default SLIT distance sufficient for ensuring a DRAM-first
> > default policy?
>
> "DRAM-first" may sound ambiguous, actually I mean "DRAM only by
> default". SLIT should just can tell us what node is local what node is
> remote, but can't tell us the performance difference.

I think it's a useful semantic, but let's leave the selection of that
policy to an explicit userspace decision.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-25 23:18         ` Dan Williams
  (?)
@ 2019-03-25 23:36         ` Yang Shi
  2019-03-25 23:42             ` Dan Williams
  -1 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-25 23:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List, Vishal L Verma



On 3/25/19 4:18 PM, Dan Williams wrote:
> On Mon, Mar 25, 2019 at 12:28 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>
>>
>> On 3/23/19 10:21 AM, Dan Williams wrote:
>>> On Fri, Mar 22, 2019 at 9:45 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
>>>> When running applications on the machine with NVDIMM as NUMA node, the
>>>> memory allocation may end up on NVDIMM node.  This may result in silent
>>>> performance degradation and regression due to the difference of hardware
>>>> property.
>>>>
>>>> DRAM first should be obeyed to prevent from surprising regression.  Any
>>>> non-DRAM nodes should be excluded from default allocation.  Use nodemask
>>>> to control the memory placement.  Introduce def_alloc_nodemask which has
>>>> DRAM nodes set only.  Any non-DRAM allocation should be specified by
>>>> NUMA policy explicitly.
>>>>
>>>> In the future we may be able to extract the memory charasteristics from
>>>> HMAT or other source to build up the default allocation nodemask.
>>>> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag
>>>> for the time being.
>>>>
>>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>>> ---
>>>>    arch/x86/mm/numa.c     |  1 +
>>>>    drivers/acpi/numa.c    |  8 ++++++++
>>>>    include/linux/mmzone.h |  3 +++
>>>>    mm/page_alloc.c        | 18 ++++++++++++++++--
>>>>    4 files changed, 28 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>>> index dfb6c4d..d9e0ca4 100644
>>>> --- a/arch/x86/mm/numa.c
>>>> +++ b/arch/x86/mm/numa.c
>>>> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void))
>>>>           nodes_clear(numa_nodes_parsed);
>>>>           nodes_clear(node_possible_map);
>>>>           nodes_clear(node_online_map);
>>>> +       nodes_clear(def_alloc_nodemask);
>>>>           memset(&numa_meminfo, 0, sizeof(numa_meminfo));
>>>>           WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
>>>>                                     MAX_NUMNODES));
>>>> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
>>>> index 867f6e3..79dfedf 100644
>>>> --- a/drivers/acpi/numa.c
>>>> +++ b/drivers/acpi/numa.c
>>>> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit)
>>>>                   goto out_err_bad_srat;
>>>>           }
>>>>
>>>> +       /*
>>>> +        * Non volatile memory is excluded from zonelist by default.
>>>> +        * Only regular DRAM nodes are set in default allocation node
>>>> +        * mask.
>>>> +        */
>>>> +       if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE))
>>>> +               node_set(node, def_alloc_nodemask);
>>> Hmm, no, I don't think we should do this. Especially considering
>>> current generation NVDIMMs are energy backed DRAM there is no
>>> performance difference that should be assumed by the non-volatile
>>> flag.
>> Actually, here I would like to initialize a node mask for default
>> allocation. Memory allocation should not end up on any nodes excluded by
>> this node mask unless they are specified by mempolicy.
>>
>> We may have a few different ways or criteria to initialize the node
>> mask, for example, we can read from HMAT (when HMAT is ready in the
>> future), and we definitely could have non-DRAM nodes set if they have no
>> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
>>
>> As long as there are different tiers, distinguished by performance, for
>> main memory, IMHO, there should be a defined default allocation node
>> mask to control the memory placement no matter where we get the information.
> I understand the intent, but I don't think the kernel should have such
> a hardline policy by default. However, it would be worthwhile
> mechanism and policy to consider for the dax-hotplug userspace
> tooling. I.e. arrange for a given device-dax instance to be onlined,
> but set the policy to require explicit opt-in by numa binding for it
> to be an allocation / migration option.
>
> I added Vishal to the cc who is looking into such policy tooling.

We may assume the nodes returned by cpu_to_node() would be treated as 
the default allocation nodes from the kernel point of view.

So, the below code may do the job:

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d9e0ca4..a3e07da 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -764,6 +764,8 @@ void __init init_cpu_to_node(void)
                         init_memory_less_node(node);

                 numa_set_node(cpu, node);
+
+              node_set(node, def_alloc_nodemask);
         }
  }

Actually, the kernel should not care too much what kind of memory is 
used, any node could be used for memory allocation. But it may be better 
to restrict to some default nodes due to the performance disparity, for 
example, default to regular DRAM only. Here kernel assumes the nodes 
associated with CPUs would be DRAM nodes.

The node mask could be exported to user space to be override by 
userspace tool or sysfs or kernel commandline. But I still think kernel 
does need a default node mask.

>
>> But, for now we haven't had such information ready for such use yet, so
>> the SRAT flag might be a choice.
>>
>>> Why isn't default SLIT distance sufficient for ensuring a DRAM-first
>>> default policy?
>> "DRAM-first" may sound ambiguous, actually I mean "DRAM only by
>> default". SLIT should just can tell us what node is local what node is
>> remote, but can't tell us the performance difference.
> I think it's a useful semantic, but let's leave the selection of that
> policy to an explicit userspace decision.

Yes, mempolicy is a kind of userspace decision too.

Thanks,
Yang



^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 23:09         ` Brice Goglin
@ 2019-03-25 23:37             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:37 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 25/03/2019 à 20:29, Dan Williams a écrit :
> > Perhaps "path" might be a suitable replacement identifier rather than
> > type. I.e. memory that originates from an ACPI.NFIT root device is
> > likely "pmem".
>
>
> Could work.
>
> What kind of "path" would we get for other types of memory? (DDR,
> non-ACPI-based based PMEM if any, NVMe PMR?)

I think for memory that is described by the HMAT "Reservation hint",
and no other ACPI table, it would need to have "HMAT" in the path. For
anything not ACPI it gets easier because the path can be the parent
PCI device.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-25 23:37             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:37 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 25/03/2019 à 20:29, Dan Williams a écrit :
> > Perhaps "path" might be a suitable replacement identifier rather than
> > type. I.e. memory that originates from an ACPI.NFIT root device is
> > likely "pmem".
>
>
> Could work.
>
> What kind of "path" would we get for other types of memory? (DDR,
> non-ACPI-based based PMEM if any, NVMe PMR?)

I think for memory that is described by the HMAT "Reservation hint",
and no other ACPI table, it would need to have "HMAT" in the path. For
anything not ACPI it gets easier because the path can be the parent
PCI device.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
  2019-03-25 23:36         ` Yang Shi
@ 2019-03-25 23:42             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List, Vishal L Verma

On Mon, Mar 25, 2019 at 4:36 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
[..]
> >>> Hmm, no, I don't think we should do this. Especially considering
> >>> current generation NVDIMMs are energy backed DRAM there is no
> >>> performance difference that should be assumed by the non-volatile
> >>> flag.
> >> Actually, here I would like to initialize a node mask for default
> >> allocation. Memory allocation should not end up on any nodes excluded by
> >> this node mask unless they are specified by mempolicy.
> >>
> >> We may have a few different ways or criteria to initialize the node
> >> mask, for example, we can read from HMAT (when HMAT is ready in the
> >> future), and we definitely could have non-DRAM nodes set if they have no
> >> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
> >>
> >> As long as there are different tiers, distinguished by performance, for
> >> main memory, IMHO, there should be a defined default allocation node
> >> mask to control the memory placement no matter where we get the information.
> > I understand the intent, but I don't think the kernel should have such
> > a hardline policy by default. However, it would be worthwhile
> > mechanism and policy to consider for the dax-hotplug userspace
> > tooling. I.e. arrange for a given device-dax instance to be onlined,
> > but set the policy to require explicit opt-in by numa binding for it
> > to be an allocation / migration option.
> >
> > I added Vishal to the cc who is looking into such policy tooling.
>
> We may assume the nodes returned by cpu_to_node() would be treated as
> the default allocation nodes from the kernel point of view.
>
> So, the below code may do the job:
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index d9e0ca4..a3e07da 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -764,6 +764,8 @@ void __init init_cpu_to_node(void)
>                          init_memory_less_node(node);
>
>                  numa_set_node(cpu, node);
> +
> +              node_set(node, def_alloc_nodemask);
>          }
>   }
>
> Actually, the kernel should not care too much what kind of memory is
> used, any node could be used for memory allocation. But it may be better
> to restrict to some default nodes due to the performance disparity, for
> example, default to regular DRAM only. Here kernel assumes the nodes
> associated with CPUs would be DRAM nodes.
>
> The node mask could be exported to user space to be override by
> userspace tool or sysfs or kernel commandline.

Yes, sounds good.

> But I still think kernel does need a default node mask.

Yes, just depends on what is less surprising for userspace to contend
with by default. I would expect an unaware userspace to be confused by
the fact that the system has free memory, but it's unusable. So,
usable by default sounds a safer option, and special cases to forbid
default usage of given nodes is an administrator / application opt-in
mechanism.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory
@ 2019-03-25 23:42             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-25 23:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List, Vishal L Verma

On Mon, Mar 25, 2019 at 4:36 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
[..]
> >>> Hmm, no, I don't think we should do this. Especially considering
> >>> current generation NVDIMMs are energy backed DRAM there is no
> >>> performance difference that should be assumed by the non-volatile
> >>> flag.
> >> Actually, here I would like to initialize a node mask for default
> >> allocation. Memory allocation should not end up on any nodes excluded by
> >> this node mask unless they are specified by mempolicy.
> >>
> >> We may have a few different ways or criteria to initialize the node
> >> mask, for example, we can read from HMAT (when HMAT is ready in the
> >> future), and we definitely could have non-DRAM nodes set if they have no
> >> performance difference (I'm supposed you mean NVDIMM-F  or HBM).
> >>
> >> As long as there are different tiers, distinguished by performance, for
> >> main memory, IMHO, there should be a defined default allocation node
> >> mask to control the memory placement no matter where we get the information.
> > I understand the intent, but I don't think the kernel should have such
> > a hardline policy by default. However, it would be worthwhile
> > mechanism and policy to consider for the dax-hotplug userspace
> > tooling. I.e. arrange for a given device-dax instance to be onlined,
> > but set the policy to require explicit opt-in by numa binding for it
> > to be an allocation / migration option.
> >
> > I added Vishal to the cc who is looking into such policy tooling.
>
> We may assume the nodes returned by cpu_to_node() would be treated as
> the default allocation nodes from the kernel point of view.
>
> So, the below code may do the job:
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index d9e0ca4..a3e07da 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -764,6 +764,8 @@ void __init init_cpu_to_node(void)
>                          init_memory_less_node(node);
>
>                  numa_set_node(cpu, node);
> +
> +              node_set(node, def_alloc_nodemask);
>          }
>   }
>
> Actually, the kernel should not care too much what kind of memory is
> used, any node could be used for memory allocation. But it may be better
> to restrict to some default nodes due to the performance disparity, for
> example, default to regular DRAM only. Here kernel assumes the nodes
> associated with CPUs would be DRAM nodes.
>
> The node mask could be exported to user space to be override by
> userspace tool or sysfs or kernel commandline.

Yes, sounds good.

> But I still think kernel does need a default node mask.

Yes, just depends on what is less surprising for userspace to contend
with by default. I would expect an unaware userspace to be confused by
the fact that the system has free memory, but it's unusable. So,
usable by default sounds a safer option, and special cases to forbid
default usage of given nodes is an administrator / application opt-in
mechanism.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-25 23:37             ` Dan Williams
  (?)
@ 2019-03-26 12:19             ` Jonathan Cameron
  -1 siblings, 0 replies; 66+ messages in thread
From: Jonathan Cameron @ 2019-03-26 12:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Brice Goglin, Yang Shi, Michal Hocko, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Mon, 25 Mar 2019 16:37:07 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Mon, Mar 25, 2019 at 4:09 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >
> >
> > Le 25/03/2019 à 20:29, Dan Williams a écrit :  
> > > Perhaps "path" might be a suitable replacement identifier rather than
> > > type. I.e. memory that originates from an ACPI.NFIT root device is
> > > likely "pmem".  
> >
> >
> > Could work.
> >
> > What kind of "path" would we get for other types of memory? (DDR,
> > non-ACPI-based based PMEM if any, NVMe PMR?)  
> 
> I think for memory that is described by the HMAT "Reservation hint",
> and no other ACPI table, it would need to have "HMAT" in the path. For
> anything not ACPI it gets easier because the path can be the parent
> PCI device.
> 

There is no HMAT reservation hint in ACPI 6.3 - but there are other ways
of doing much the same thing so this is just a nitpick.

J


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
                   ` (10 preceding siblings ...)
  2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
@ 2019-03-26 13:58 ` Michal Hocko
  2019-03-26 18:33   ` Yang Shi
  11 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-26 13:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

On Sat 23-03-19 12:44:25, Yang Shi wrote:
> 
> With Dave Hansen's patches merged into Linus's tree
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> 
> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> effectively and efficiently is still a question. 
> 
> There have been a couple of proposals posted on the mailing list [1] [2].
> 
> The patchset is aimed to try a different approach from this proposal [1]
> to use PMEM as NUMA nodes.
> 
> The approach is designed to follow the below principles:
> 
> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> 
> 2. DRAM first/by default. No surprise to existing applications and default
> running. PMEM will not be allocated unless its node is specified explicitly
> by NUMA policy. Some applications may be not very sensitive to memory latency,
> so they could be placed on PMEM nodes then have hot pages promote to DRAM
> gradually.

Why are you pushing yourself into the corner right at the beginning? If
the PMEM is exported as a regular NUMA node then the only difference
should be performance characteristics (module durability which shouldn't
play any role in this particular case, right?). Applications which are
already sensitive to memory access should better use proper binding already.
Some NUMA topologies might have quite a large interconnect penalties
already. So this doesn't sound like an argument to me, TBH.

> 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
> basis.

What does that mean? Anon vs. file backed memory?

[...]

> 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
> semantics intact. We would like to have memory placement control on per process
> or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
> The new mempolicy is mainly used for launching processes on PMEM nodes then
> migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
> PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
> it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
> a new mempolicy is needed to fulfill the usecase.

The above restriction pushes you to invent an API which is not really
trivial to get right and it seems quite artificial to me already.

> 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
> don't think kernel is a good place to implement sophisticated hot/cold page
> distinguish algorithm due to the complexity and overhead. But, kernel should
> have such capability. NUMA balancing sounds like a good start point.

This is what the kernel does all the time. We call it memory reclaim.

> 4. Promote twice faulted page. Use PG_promote to track if a page is faulted
> twice. This is an optimization to NUMA balancing to reduce the migration
> thrashing and overhead for migrating from PMEM.

I am sorry, but page flags are an extremely scarce resource and a new
flag is extremely hard to get. On the other hand we already do have
use-twice detection for mapped page cache (see page_check_references). I
believe we can generalize that to anon pages as well.

> 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
> This is quite similar to other proposals. Then NUMA balancing will promote
> page to DRAM as long as the page is referenced again. But, the
> promotion/demotion still assumes two tier main memory. And, the demotion may
> break mempolicy.

Yes, this sounds like a good idea to me ;)

> 6. Anonymous page only for the time being since NUMA balancing can't promote
> unmapped page cache.

As long as the nvdimm access is faster than the regular storage then
using any node (including pmem one) should be OK.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-26 13:58 ` Michal Hocko
@ 2019-03-26 18:33   ` Yang Shi
  2019-03-26 18:37     ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-26 18:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel



On 3/26/19 6:58 AM, Michal Hocko wrote:
> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>> With Dave Hansen's patches merged into Linus's tree
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>
>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>> effectively and efficiently is still a question.
>>
>> There have been a couple of proposals posted on the mailing list [1] [2].
>>
>> The patchset is aimed to try a different approach from this proposal [1]
>> to use PMEM as NUMA nodes.
>>
>> The approach is designed to follow the below principles:
>>
>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>
>> 2. DRAM first/by default. No surprise to existing applications and default
>> running. PMEM will not be allocated unless its node is specified explicitly
>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>> gradually.
> Why are you pushing yourself into the corner right at the beginning? If
> the PMEM is exported as a regular NUMA node then the only difference
> should be performance characteristics (module durability which shouldn't
> play any role in this particular case, right?). Applications which are
> already sensitive to memory access should better use proper binding already.
> Some NUMA topologies might have quite a large interconnect penalties
> already. So this doesn't sound like an argument to me, TBH.

The major rationale behind this is we assume the most applications 
should be sensitive to memory access, particularly for meeting the SLA. 
The applications run on the machine may be agnostic to us, they may be 
sensitive or non-sensitive. But, assuming they are sensitive to memory 
access sounds safer from SLA point of view. Then the "cold" pages could 
be demoted to PMEM nodes by kernel's memory reclaim or other tools 
without impairing the SLA.

If the applications are not sensitive to memory access, they could be 
bound to PMEM or allowed to use PMEM (nice to have allocation on DRAM) 
explicitly, then the "hot" pages could be promoted to DRAM.

>
>> 5. Control memory allocation and hot/cold pages promotion/demotion on per VMA
>> basis.
> What does that mean? Anon vs. file backed memory?

Yes, kind of. Basically, we would like to control the memory placement 
and promotion (by NUMA balancing) per VMA basis. For example, anon VMAs 
may be DRAM by default, file backed VMAs may be PMEM by default. Anyway, 
basically this is achieved freely by mempolicy.

>
> [...]
>
>> 2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicy
>> semantics intact. We would like to have memory placement control on per process
>> or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
>> The new mempolicy is mainly used for launching processes on PMEM nodes then
>> migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
>> PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
>> it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
>> a new mempolicy is needed to fulfill the usecase.
> The above restriction pushes you to invent an API which is not really
> trivial to get right and it seems quite artificial to me already.

First of all, the use case is some applications may be not that 
sensitive to memory access or are willing to achieve net win by trading 
some performance to save some cost (have some memory on PMEM). So, such 
applications may be bound to PMEM at the first place then promote hot 
pages to DRAM via NUMA balancing or whatever mechanism.

Both MPOL_BIND and MPOL_PREFERRED sounds not fit into this usecase quite 
naturally.

Secondly, it looks just default policy does NUMA balancing. Once the 
policy is changed to MPOL_BIND, NUMA balancing would not chime in.

So, I invented the new mempolicy.

>
>> 3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, I
>> don't think kernel is a good place to implement sophisticated hot/cold page
>> distinguish algorithm due to the complexity and overhead. But, kernel should
>> have such capability. NUMA balancing sounds like a good start point.
> This is what the kernel does all the time. We call it memory reclaim.
>
>> 4. Promote twice faulted page. Use PG_promote to track if a page is faulted
>> twice. This is an optimization to NUMA balancing to reduce the migration
>> thrashing and overhead for migrating from PMEM.
> I am sorry, but page flags are an extremely scarce resource and a new
> flag is extremely hard to get. On the other hand we already do have
> use-twice detection for mapped page cache (see page_check_references). I
> believe we can generalize that to anon pages as well.

Yes, I agree. A new page flag sounds not preferred. I'm going to take a 
look at page_check_references().

>
>> 5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.
>> This is quite similar to other proposals. Then NUMA balancing will promote
>> page to DRAM as long as the page is referenced again. But, the
>> promotion/demotion still assumes two tier main memory. And, the demotion may
>> break mempolicy.
> Yes, this sounds like a good idea to me ;)
>
>> 6. Anonymous page only for the time being since NUMA balancing can't promote
>> unmapped page cache.
> As long as the nvdimm access is faster than the regular storage then
> using any node (including pmem one) should be OK.

However, it still sounds better to have some frequently accessed page 
cache on DRAM.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-26 18:33   ` Yang Shi
@ 2019-03-26 18:37     ` Michal Hocko
  2019-03-27  2:58       ` Yang Shi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-26 18:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

On Tue 26-03-19 11:33:17, Yang Shi wrote:
> 
> 
> On 3/26/19 6:58 AM, Michal Hocko wrote:
> > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > With Dave Hansen's patches merged into Linus's tree
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > 
> > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > effectively and efficiently is still a question.
> > > 
> > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > 
> > > The patchset is aimed to try a different approach from this proposal [1]
> > > to use PMEM as NUMA nodes.
> > > 
> > > The approach is designed to follow the below principles:
> > > 
> > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > 
> > > 2. DRAM first/by default. No surprise to existing applications and default
> > > running. PMEM will not be allocated unless its node is specified explicitly
> > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > gradually.
> > Why are you pushing yourself into the corner right at the beginning? If
> > the PMEM is exported as a regular NUMA node then the only difference
> > should be performance characteristics (module durability which shouldn't
> > play any role in this particular case, right?). Applications which are
> > already sensitive to memory access should better use proper binding already.
> > Some NUMA topologies might have quite a large interconnect penalties
> > already. So this doesn't sound like an argument to me, TBH.
> 
> The major rationale behind this is we assume the most applications should be
> sensitive to memory access, particularly for meeting the SLA. The
> applications run on the machine may be agnostic to us, they may be sensitive
> or non-sensitive. But, assuming they are sensitive to memory access sounds
> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> nodes by kernel's memory reclaim or other tools without impairing the SLA.
> 
> If the applications are not sensitive to memory access, they could be bound
> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> then the "hot" pages could be promoted to DRAM.

Again, how is this different from NUMA in general?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-25 19:49     ` Yang Shi
@ 2019-03-27  0:35       ` Keith Busch
  2019-03-27  3:41         ` Yang Shi
  0 siblings, 1 reply; 66+ messages in thread
From: Keith Busch @ 2019-03-27  0:35 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

On Mon, Mar 25, 2019 at 12:49:21PM -0700, Yang Shi wrote:
> On 3/24/19 3:20 PM, Keith Busch wrote:
> > How do these pages eventually get to swap when migration fails? Looks
> > like that's skipped.
> 
> Yes, they will be just put back to LRU. Actually, I don't expect it would be
> very often to have migration fail at this stage (but I have no test data to
> support this hypothesis) since the pages have been isolated from LRU, so
> other reclaim path should not find them anymore.
> 
> If it is locked by someone else right before migration, it is likely
> referenced again, so putting back to LRU sounds not bad.
> 
> A potential improvement is to have sync migration for kswapd.

Well, it's not that migration fails only if the page is recently
referenced. Migration would fail if there isn't available memory in
the migration node, so this implementation carries an expectation that
migration nodes have higher free capacity than source nodes. And since
your attempting THP's without ever splitting them, that also requires
lower fragmentation for a successful migration.

Applications, however, may allocate and pin pages directly out of that
migration node to the point it does not have so much free capacity or
physical continuity, so we probably shouldn't assume it's the only way
to reclaim pages.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-26 18:37     ` Michal Hocko
@ 2019-03-27  2:58       ` Yang Shi
  2019-03-27  9:01         ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-27  2:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel



On 3/26/19 11:37 AM, Michal Hocko wrote:
> On Tue 26-03-19 11:33:17, Yang Shi wrote:
>>
>> On 3/26/19 6:58 AM, Michal Hocko wrote:
>>> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>>>> With Dave Hansen's patches merged into Linus's tree
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>>>
>>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>>>> effectively and efficiently is still a question.
>>>>
>>>> There have been a couple of proposals posted on the mailing list [1] [2].
>>>>
>>>> The patchset is aimed to try a different approach from this proposal [1]
>>>> to use PMEM as NUMA nodes.
>>>>
>>>> The approach is designed to follow the below principles:
>>>>
>>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>>>
>>>> 2. DRAM first/by default. No surprise to existing applications and default
>>>> running. PMEM will not be allocated unless its node is specified explicitly
>>>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>>>> gradually.
>>> Why are you pushing yourself into the corner right at the beginning? If
>>> the PMEM is exported as a regular NUMA node then the only difference
>>> should be performance characteristics (module durability which shouldn't
>>> play any role in this particular case, right?). Applications which are
>>> already sensitive to memory access should better use proper binding already.
>>> Some NUMA topologies might have quite a large interconnect penalties
>>> already. So this doesn't sound like an argument to me, TBH.
>> The major rationale behind this is we assume the most applications should be
>> sensitive to memory access, particularly for meeting the SLA. The
>> applications run on the machine may be agnostic to us, they may be sensitive
>> or non-sensitive. But, assuming they are sensitive to memory access sounds
>> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
>> nodes by kernel's memory reclaim or other tools without impairing the SLA.
>>
>> If the applications are not sensitive to memory access, they could be bound
>> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
>> then the "hot" pages could be promoted to DRAM.
> Again, how is this different from NUMA in general?

It is still NUMA, users still can see all the NUMA nodes.

Introduced default allocation node mask (please refer to patch #1) to 
control the memory placement. Typically, the node mask just includes 
DRAM nodes. PMEM nodes are excluded by the node mask for memory allocation.

The node mask could be override by user per the discussion with Dan.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27  0:35       ` Keith Busch
@ 2019-03-27  3:41         ` Yang Shi
  2019-03-27 13:08           ` Keith Busch
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-27  3:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: mhocko, mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel



On 3/26/19 5:35 PM, Keith Busch wrote:
> On Mon, Mar 25, 2019 at 12:49:21PM -0700, Yang Shi wrote:
>> On 3/24/19 3:20 PM, Keith Busch wrote:
>>> How do these pages eventually get to swap when migration fails? Looks
>>> like that's skipped.
>> Yes, they will be just put back to LRU. Actually, I don't expect it would be
>> very often to have migration fail at this stage (but I have no test data to
>> support this hypothesis) since the pages have been isolated from LRU, so
>> other reclaim path should not find them anymore.
>>
>> If it is locked by someone else right before migration, it is likely
>> referenced again, so putting back to LRU sounds not bad.
>>
>> A potential improvement is to have sync migration for kswapd.
> Well, it's not that migration fails only if the page is recently
> referenced. Migration would fail if there isn't available memory in
> the migration node, so this implementation carries an expectation that
> migration nodes have higher free capacity than source nodes. And since
> your attempting THP's without ever splitting them, that also requires
> lower fragmentation for a successful migration.

Yes, it is possible. However, migrate_pages() already has logic to 
handle such case. If the target node has not enough space for migrating 
THP in a whole, it would split THP then retry with base pages.

Swapping THP has been optimized to swap in a whole too. It would try to 
add THP into swap cache in a whole, split THP if the attempt fails, then 
add base pages into swap cache.

So, I think we can leave this to migrate_pages() without splitting in 
advance all the time.

Thanks,
Yang

>
> Applications, however, may allocate and pin pages directly out of that
> migration node to the point it does not have so much free capacity or
> physical continuity, so we probably shouldn't assume it's the only way
> to reclaim pages.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27  2:58       ` Yang Shi
@ 2019-03-27  9:01         ` Michal Hocko
  2019-03-27 17:34             ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-27  9:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: mgorman, riel, hannes, akpm, dave.hansen, keith.busch,
	dan.j.williams, fengguang.wu, fan.du, ying.huang, linux-mm,
	linux-kernel

On Tue 26-03-19 19:58:56, Yang Shi wrote:
> 
> 
> On 3/26/19 11:37 AM, Michal Hocko wrote:
> > On Tue 26-03-19 11:33:17, Yang Shi wrote:
> > > 
> > > On 3/26/19 6:58 AM, Michal Hocko wrote:
> > > > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > > > With Dave Hansen's patches merged into Linus's tree
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > > > 
> > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > > > effectively and efficiently is still a question.
> > > > > 
> > > > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > > > 
> > > > > The patchset is aimed to try a different approach from this proposal [1]
> > > > > to use PMEM as NUMA nodes.
> > > > > 
> > > > > The approach is designed to follow the below principles:
> > > > > 
> > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > > > 
> > > > > 2. DRAM first/by default. No surprise to existing applications and default
> > > > > running. PMEM will not be allocated unless its node is specified explicitly
> > > > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > > > gradually.
> > > > Why are you pushing yourself into the corner right at the beginning? If
> > > > the PMEM is exported as a regular NUMA node then the only difference
> > > > should be performance characteristics (module durability which shouldn't
> > > > play any role in this particular case, right?). Applications which are
> > > > already sensitive to memory access should better use proper binding already.
> > > > Some NUMA topologies might have quite a large interconnect penalties
> > > > already. So this doesn't sound like an argument to me, TBH.
> > > The major rationale behind this is we assume the most applications should be
> > > sensitive to memory access, particularly for meeting the SLA. The
> > > applications run on the machine may be agnostic to us, they may be sensitive
> > > or non-sensitive. But, assuming they are sensitive to memory access sounds
> > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> > > nodes by kernel's memory reclaim or other tools without impairing the SLA.
> > > 
> > > If the applications are not sensitive to memory access, they could be bound
> > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> > > then the "hot" pages could be promoted to DRAM.
> > Again, how is this different from NUMA in general?
> 
> It is still NUMA, users still can see all the NUMA nodes.

No, Linux NUMA implementation makes all numa nodes available by default
and provides an API to opt-in for more fine tuning. What you are
suggesting goes against that semantic and I am asking why. How is pmem
NUMA node any different from any any other distant node in principle?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27  3:41         ` Yang Shi
@ 2019-03-27 13:08           ` Keith Busch
  2019-03-27 17:00             ` Zi Yan
  2019-03-28 21:59             ` Yang Shi
  0 siblings, 2 replies; 66+ messages in thread
From: Keith Busch @ 2019-03-27 13:08 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, mgorman, riel, hannes, akpm, Hansen, Dave, Busch, Keith,
	Williams, Dan J, Wu, Fengguang, Du, Fan, Huang, Ying, linux-mm,
	linux-kernel

On Tue, Mar 26, 2019 at 08:41:15PM -0700, Yang Shi wrote:
> On 3/26/19 5:35 PM, Keith Busch wrote:
> > migration nodes have higher free capacity than source nodes. And since
> > your attempting THP's without ever splitting them, that also requires
> > lower fragmentation for a successful migration.
> 
> Yes, it is possible. However, migrate_pages() already has logic to 
> handle such case. If the target node has not enough space for migrating 
> THP in a whole, it would split THP then retry with base pages.

Oh, you're right, my mistake on splitting. So you have a good best effort
migrate, but I still think it can fail for legitimate reasons that should
have a swap fallback.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 13:08           ` Keith Busch
@ 2019-03-27 17:00             ` Zi Yan
  2019-03-27 17:05               ` Dave Hansen
  2019-03-28 21:59             ` Yang Shi
  1 sibling, 1 reply; 66+ messages in thread
From: Zi Yan @ 2019-03-27 17:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Yang Shi, mhocko, mgorman, riel, hannes, akpm, Hansen, Dave,
	Busch, Keith, Williams, Dan J, Wu, Fengguang, Du, Fan, Huang,
	Ying, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1376 bytes --]

On 27 Mar 2019, at 6:08, Keith Busch wrote:

> On Tue, Mar 26, 2019 at 08:41:15PM -0700, Yang Shi wrote:
>> On 3/26/19 5:35 PM, Keith Busch wrote:
>>> migration nodes have higher free capacity than source nodes. And since
>>> your attempting THP's without ever splitting them, that also requires
>>> lower fragmentation for a successful migration.
>>
>> Yes, it is possible. However, migrate_pages() already has logic to
>> handle such case. If the target node has not enough space for migrating
>> THP in a whole, it would split THP then retry with base pages.
>
> Oh, you're right, my mistake on splitting. So you have a good best effort
> migrate, but I still think it can fail for legitimate reasons that should
> have a swap fallback.

Does this mean we might want to factor out the page reclaim code in shrink_page_list()
and call it for each page, which fails to migrate to PMEM. Or do you still prefer
to migrate one page at a time, like what you did in your patch?

I ask this because I observe that migrating a list of pages can achieve higher
throughput compared to migrating individual page. For example, migrating 512 4KB
pages can achieve ~750MB/s throughput, whereas migrating one 4KB page might only
achieve ~40MB/s throughput. The experiments were done on a two-socket machine
with two Xeon E5-2650 v3 @ 2.30GHz across the QPI link.


--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 17:00             ` Zi Yan
@ 2019-03-27 17:05               ` Dave Hansen
  2019-03-27 17:48                 ` Zi Yan
  0 siblings, 1 reply; 66+ messages in thread
From: Dave Hansen @ 2019-03-27 17:05 UTC (permalink / raw)
  To: Zi Yan, Keith Busch
  Cc: Yang Shi, mhocko, mgorman, riel, hannes, akpm, Busch, Keith,
	Williams, Dan J, Wu, Fengguang, Du, Fan, Huang, Ying, linux-mm,
	linux-kernel

On 3/27/19 10:00 AM, Zi Yan wrote:
> I ask this because I observe that migrating a list of pages can
> achieve higher throughput compared to migrating individual page.
> For example, migrating 512 4KB pages can achieve ~750MB/s
> throughput, whereas migrating one 4KB page might only achieve
> ~40MB/s throughput. The experiments were done on a two-socket
> machine with two Xeon E5-2650 v3 @ 2.30GHz across the QPI link.

What kind of migration?

If you're talking about doing sys_migrate_pages() one page at a time,
that's a world away from doing something inside of the kernel one page
at a time.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27  9:01         ` Michal Hocko
@ 2019-03-27 17:34             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-27 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> >
> >
> > On 3/26/19 11:37 AM, Michal Hocko wrote:
> > > On Tue 26-03-19 11:33:17, Yang Shi wrote:
> > > >
> > > > On 3/26/19 6:58 AM, Michal Hocko wrote:
> > > > > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > > > > With Dave Hansen's patches merged into Linus's tree
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > > > >
> > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > > > > effectively and efficiently is still a question.
> > > > > >
> > > > > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > > > >
> > > > > > The patchset is aimed to try a different approach from this proposal [1]
> > > > > > to use PMEM as NUMA nodes.
> > > > > >
> > > > > > The approach is designed to follow the below principles:
> > > > > >
> > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > > > >
> > > > > > 2. DRAM first/by default. No surprise to existing applications and default
> > > > > > running. PMEM will not be allocated unless its node is specified explicitly
> > > > > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > > > > gradually.
> > > > > Why are you pushing yourself into the corner right at the beginning? If
> > > > > the PMEM is exported as a regular NUMA node then the only difference
> > > > > should be performance characteristics (module durability which shouldn't
> > > > > play any role in this particular case, right?). Applications which are
> > > > > already sensitive to memory access should better use proper binding already.
> > > > > Some NUMA topologies might have quite a large interconnect penalties
> > > > > already. So this doesn't sound like an argument to me, TBH.
> > > > The major rationale behind this is we assume the most applications should be
> > > > sensitive to memory access, particularly for meeting the SLA. The
> > > > applications run on the machine may be agnostic to us, they may be sensitive
> > > > or non-sensitive. But, assuming they are sensitive to memory access sounds
> > > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> > > > nodes by kernel's memory reclaim or other tools without impairing the SLA.
> > > >
> > > > If the applications are not sensitive to memory access, they could be bound
> > > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> > > > then the "hot" pages could be promoted to DRAM.
> > > Again, how is this different from NUMA in general?
> >
> > It is still NUMA, users still can see all the NUMA nodes.
>
> No, Linux NUMA implementation makes all numa nodes available by default
> and provides an API to opt-in for more fine tuning. What you are
> suggesting goes against that semantic and I am asking why. How is pmem
> NUMA node any different from any any other distant node in principle?

Agree. It's just another NUMA node and shouldn't be special cased.
Userspace policy can choose to avoid it, but typical node distance
preference should otherwise let the kernel fall back to it as
additional memory pressure relief for "near" memory.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-27 17:34             ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-27 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> >
> >
> > On 3/26/19 11:37 AM, Michal Hocko wrote:
> > > On Tue 26-03-19 11:33:17, Yang Shi wrote:
> > > >
> > > > On 3/26/19 6:58 AM, Michal Hocko wrote:
> > > > > On Sat 23-03-19 12:44:25, Yang Shi wrote:
> > > > > > With Dave Hansen's patches merged into Linus's tree
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
> > > > > >
> > > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> > > > > > effectively and efficiently is still a question.
> > > > > >
> > > > > > There have been a couple of proposals posted on the mailing list [1] [2].
> > > > > >
> > > > > > The patchset is aimed to try a different approach from this proposal [1]
> > > > > > to use PMEM as NUMA nodes.
> > > > > >
> > > > > > The approach is designed to follow the below principles:
> > > > > >
> > > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> > > > > >
> > > > > > 2. DRAM first/by default. No surprise to existing applications and default
> > > > > > running. PMEM will not be allocated unless its node is specified explicitly
> > > > > > by NUMA policy. Some applications may be not very sensitive to memory latency,
> > > > > > so they could be placed on PMEM nodes then have hot pages promote to DRAM
> > > > > > gradually.
> > > > > Why are you pushing yourself into the corner right at the beginning? If
> > > > > the PMEM is exported as a regular NUMA node then the only difference
> > > > > should be performance characteristics (module durability which shouldn't
> > > > > play any role in this particular case, right?). Applications which are
> > > > > already sensitive to memory access should better use proper binding already.
> > > > > Some NUMA topologies might have quite a large interconnect penalties
> > > > > already. So this doesn't sound like an argument to me, TBH.
> > > > The major rationale behind this is we assume the most applications should be
> > > > sensitive to memory access, particularly for meeting the SLA. The
> > > > applications run on the machine may be agnostic to us, they may be sensitive
> > > > or non-sensitive. But, assuming they are sensitive to memory access sounds
> > > > safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
> > > > nodes by kernel's memory reclaim or other tools without impairing the SLA.
> > > >
> > > > If the applications are not sensitive to memory access, they could be bound
> > > > to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
> > > > then the "hot" pages could be promoted to DRAM.
> > > Again, how is this different from NUMA in general?
> >
> > It is still NUMA, users still can see all the NUMA nodes.
>
> No, Linux NUMA implementation makes all numa nodes available by default
> and provides an API to opt-in for more fine tuning. What you are
> suggesting goes against that semantic and I am asking why. How is pmem
> NUMA node any different from any any other distant node in principle?

Agree. It's just another NUMA node and shouldn't be special cased.
Userspace policy can choose to avoid it, but typical node distance
preference should otherwise let the kernel fall back to it as
additional memory pressure relief for "near" memory.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 17:05               ` Dave Hansen
@ 2019-03-27 17:48                 ` Zi Yan
  2019-03-27 18:00                   ` Dave Hansen
  0 siblings, 1 reply; 66+ messages in thread
From: Zi Yan @ 2019-03-27 17:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Keith Busch, Yang Shi, mhocko, mgorman, riel, hannes, akpm,
	Busch, Keith, Williams, Dan J, Wu, Fengguang, Du, Fan, Huang,
	Ying, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]

On 27 Mar 2019, at 10:05, Dave Hansen wrote:

> On 3/27/19 10:00 AM, Zi Yan wrote:
>> I ask this because I observe that migrating a list of pages can
>> achieve higher throughput compared to migrating individual page.
>> For example, migrating 512 4KB pages can achieve ~750MB/s
>> throughput, whereas migrating one 4KB page might only achieve
>> ~40MB/s throughput. The experiments were done on a two-socket
>> machine with two Xeon E5-2650 v3 @ 2.30GHz across the QPI link.
>
> What kind of migration?
>
> If you're talking about doing sys_migrate_pages() one page at a time,
> that's a world away from doing something inside of the kernel one page
> at a time.

For 40MB/s vs 750MB/s, they were using sys_migrate_pages(). Sorry about
the confusion there. As I measure only the migrate_pages() in the kernel,
the throughput becomes:
migrating 4KB page: 0.312GB/s vs migrating 512 4KB pages: 0.854GB/s.
They are still >2x difference.

Furthermore, if we only consider the migrate_page_copy() in mm/migrate.c,
which only calls copy_highpage() and migrate_page_states(), the throughput
becomes:
migrating 4KB page: 1.385GB/s vs migrating 512 4KB pages: 1.983GB/s.
The gap is smaller, but migrating 512 4KB pages still achieves 40% more
throughput.

Do these numbers make sense to you?

--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 17:48                 ` Zi Yan
@ 2019-03-27 18:00                   ` Dave Hansen
  2019-03-27 20:37                     ` Zi Yan
  0 siblings, 1 reply; 66+ messages in thread
From: Dave Hansen @ 2019-03-27 18:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Keith Busch, Yang Shi, mhocko, mgorman, riel, hannes, akpm,
	Busch, Keith, Williams, Dan J, Wu, Fengguang, Du, Fan, Huang,
	Ying, linux-mm, linux-kernel

On 3/27/19 10:48 AM, Zi Yan wrote:
> For 40MB/s vs 750MB/s, they were using sys_migrate_pages(). Sorry
> about the confusion there. As I measure only the migrate_pages() in
> the kernel, the throughput becomes: migrating 4KB page: 0.312GB/s
> vs migrating 512 4KB pages: 0.854GB/s. They are still >2x
> difference.
> 
> Furthermore, if we only consider the migrate_page_copy() in
> mm/migrate.c, which only calls copy_highpage() and
> migrate_page_states(), the throughput becomes: migrating 4KB page:
> 1.385GB/s vs migrating 512 4KB pages: 1.983GB/s. The gap is
> smaller, but migrating 512 4KB pages still achieves 40% more 
> throughput.
> 
> Do these numbers make sense to you?

Yes.  It would be very interesting to batch the migrations in the
kernel and see how it affects the code.  A 50% boost is interesting,
but not if it's only in microbenchmarks and takes 2k lines of code.

50% is *very* interesting if it happens in the real world and we can
do it in 10 lines of code.

So, let's see what the code looks like.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 17:34             ` Dan Williams
  (?)
@ 2019-03-27 18:59             ` Yang Shi
  2019-03-27 20:09               ` Michal Hocko
  2019-03-27 20:14               ` Dave Hansen
  -1 siblings, 2 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-27 18:59 UTC (permalink / raw)
  To: Dan Williams, Michal Hocko
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, Andrew Morton,
	Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan, Huang, Ying,
	Linux MM, Linux Kernel Mailing List



On 3/27/19 10:34 AM, Dan Williams wrote:
> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
>>>
>>> On 3/26/19 11:37 AM, Michal Hocko wrote:
>>>> On Tue 26-03-19 11:33:17, Yang Shi wrote:
>>>>> On 3/26/19 6:58 AM, Michal Hocko wrote:
>>>>>> On Sat 23-03-19 12:44:25, Yang Shi wrote:
>>>>>>> With Dave Hansen's patches merged into Linus's tree
>>>>>>>
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
>>>>>>>
>>>>>>> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
>>>>>>> effectively and efficiently is still a question.
>>>>>>>
>>>>>>> There have been a couple of proposals posted on the mailing list [1] [2].
>>>>>>>
>>>>>>> The patchset is aimed to try a different approach from this proposal [1]
>>>>>>> to use PMEM as NUMA nodes.
>>>>>>>
>>>>>>> The approach is designed to follow the below principles:
>>>>>>>
>>>>>>> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
>>>>>>>
>>>>>>> 2. DRAM first/by default. No surprise to existing applications and default
>>>>>>> running. PMEM will not be allocated unless its node is specified explicitly
>>>>>>> by NUMA policy. Some applications may be not very sensitive to memory latency,
>>>>>>> so they could be placed on PMEM nodes then have hot pages promote to DRAM
>>>>>>> gradually.
>>>>>> Why are you pushing yourself into the corner right at the beginning? If
>>>>>> the PMEM is exported as a regular NUMA node then the only difference
>>>>>> should be performance characteristics (module durability which shouldn't
>>>>>> play any role in this particular case, right?). Applications which are
>>>>>> already sensitive to memory access should better use proper binding already.
>>>>>> Some NUMA topologies might have quite a large interconnect penalties
>>>>>> already. So this doesn't sound like an argument to me, TBH.
>>>>> The major rationale behind this is we assume the most applications should be
>>>>> sensitive to memory access, particularly for meeting the SLA. The
>>>>> applications run on the machine may be agnostic to us, they may be sensitive
>>>>> or non-sensitive. But, assuming they are sensitive to memory access sounds
>>>>> safer from SLA point of view. Then the "cold" pages could be demoted to PMEM
>>>>> nodes by kernel's memory reclaim or other tools without impairing the SLA.
>>>>>
>>>>> If the applications are not sensitive to memory access, they could be bound
>>>>> to PMEM or allowed to use PMEM (nice to have allocation on DRAM) explicitly,
>>>>> then the "hot" pages could be promoted to DRAM.
>>>> Again, how is this different from NUMA in general?
>>> It is still NUMA, users still can see all the NUMA nodes.
>> No, Linux NUMA implementation makes all numa nodes available by default
>> and provides an API to opt-in for more fine tuning. What you are
>> suggesting goes against that semantic and I am asking why. How is pmem
>> NUMA node any different from any any other distant node in principle?
> Agree. It's just another NUMA node and shouldn't be special cased.
> Userspace policy can choose to avoid it, but typical node distance
> preference should otherwise let the kernel fall back to it as
> additional memory pressure relief for "near" memory.

In ideal case, yes, I agree. However, in real life world the performance 
is a concern. It is well-known that PMEM (not considering NVDIMM-F or 
HBM) has higher latency and lower bandwidth. We observed much higher 
latency on PMEM than DRAM with multi threads.

In real production environment we don't know what kind of applications 
would end up on PMEM (DRAM may be full, allocation fall back to PMEM) 
then have unexpected performance degradation. I understand to have 
mempolicy to choose to avoid it. But, there might be hundreds or 
thousands of applications running on the machine, it sounds not that 
feasible to me to have each single application set mempolicy to avoid it.

So, I think we still need a default allocation node mask. The default 
value may include all nodes or just DRAM nodes. But, they should be able 
to be override by user globally, not only per process basis.

Due to the performance disparity, currently our usecases treat PMEM as 
second tier memory for demoting cold page or binding to not memory 
access sensitive applications (this is the reason for inventing a new 
mempolicy) although it is a NUMA node.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 18:59             ` Yang Shi
@ 2019-03-27 20:09               ` Michal Hocko
  2019-03-28  2:09                 ` Yang Shi
  2019-03-27 20:14               ` Dave Hansen
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-27 20:09 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed 27-03-19 11:59:28, Yang Shi wrote:
> 
> 
> On 3/27/19 10:34 AM, Dan Williams wrote:
> > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Tue 26-03-19 19:58:56, Yang Shi wrote:
[...]
> > > > It is still NUMA, users still can see all the NUMA nodes.
> > > No, Linux NUMA implementation makes all numa nodes available by default
> > > and provides an API to opt-in for more fine tuning. What you are
> > > suggesting goes against that semantic and I am asking why. How is pmem
> > > NUMA node any different from any any other distant node in principle?
> > Agree. It's just another NUMA node and shouldn't be special cased.
> > Userspace policy can choose to avoid it, but typical node distance
> > preference should otherwise let the kernel fall back to it as
> > additional memory pressure relief for "near" memory.
> 
> In ideal case, yes, I agree. However, in real life world the performance is
> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> higher latency and lower bandwidth. We observed much higher latency on PMEM
> than DRAM with multi threads.

One rule of thumb is: Do not design user visible interfaces based on the
contemporary technology and its up/down sides. This will almost always
fire back.

Btw. if you keep arguing about performance without any numbers. Can you
present something specific?

> In real production environment we don't know what kind of applications would
> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> unexpected performance degradation. I understand to have mempolicy to choose
> to avoid it. But, there might be hundreds or thousands of applications
> running on the machine, it sounds not that feasible to me to have each
> single application set mempolicy to avoid it.

we have cpuset cgroup controller to help here.

> So, I think we still need a default allocation node mask. The default value
> may include all nodes or just DRAM nodes. But, they should be able to be
> override by user globally, not only per process basis.
> 
> Due to the performance disparity, currently our usecases treat PMEM as
> second tier memory for demoting cold page or binding to not memory access
> sensitive applications (this is the reason for inventing a new mempolicy)
> although it is a NUMA node.

If the performance sucks that badly then do not use the pmem as NUMA,
really. There are certainly other ways to export the pmem storage. Use
it as a fast swap storage. Or try to work on a swap caching mechanism
that still allows much faster access than a slow swap storage. But do
not try to pretend to abuse the NUMA interface while you are breaking
some of its long term established semantics.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 18:59             ` Yang Shi
  2019-03-27 20:09               ` Michal Hocko
@ 2019-03-27 20:14               ` Dave Hansen
  1 sibling, 0 replies; 66+ messages in thread
From: Dave Hansen @ 2019-03-27 20:14 UTC (permalink / raw)
  To: Yang Shi, Dan Williams, Michal Hocko
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, Andrew Morton,
	Keith Busch, Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On 3/27/19 11:59 AM, Yang Shi wrote:
> In real production environment we don't know what kind of applications
> would end up on PMEM (DRAM may be full, allocation fall back to PMEM)
> then have unexpected performance degradation. I understand to have
> mempolicy to choose to avoid it. But, there might be hundreds or
> thousands of applications running on the machine, it sounds not that
> feasible to me to have each single application set mempolicy to avoid it.

Maybe not manually, but it's entirely possible to automate this.

It would be trivial to get help from an orchestrator, or even systemd to
get apps launched with a particular policy.  Or, even a *shell* that
launches apps to have a particular policy.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 17:34             ` Dan Williams
  (?)
  (?)
@ 2019-03-27 20:35             ` Matthew Wilcox
  2019-03-27 20:40               ` Dave Hansen
  -1 siblings, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2019-03-27 20:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Dave Hansen, Keith Busch,
	Fengguang Wu, Du, Fan, Huang, Ying, Linux MM,
	Linux Kernel Mailing List

On Wed, Mar 27, 2019 at 10:34:11AM -0700, Dan Williams wrote:
> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> > No, Linux NUMA implementation makes all numa nodes available by default
> > and provides an API to opt-in for more fine tuning. What you are
> > suggesting goes against that semantic and I am asking why. How is pmem
> > NUMA node any different from any any other distant node in principle?
> 
> Agree. It's just another NUMA node and shouldn't be special cased.
> Userspace policy can choose to avoid it, but typical node distance
> preference should otherwise let the kernel fall back to it as
> additional memory pressure relief for "near" memory.

I think this is sort of true, but sort of different.  These are
essentially CPU-less nodes; there is no CPU for which they are
fast memory.  Yes, they're further from some CPUs than from others.
I have never paid attention to how Linux treats CPU-less memory nodes,
but it would make sense to me if we don't default to allocating from
remote nodes.  And treating pmem nodes as being remote from all CPUs
makes a certain amount of sense to me.

eg on a four CPU-socket system, consider this as being

pmem1 --- node1 --- node2 --- pmem2
            |   \ /   |
            |    X    |
            |   / \   |
pmem3 --- node3 --- node4 --- pmem4

which I could actually see someone building with normal DRAM, and we
should probably handle the same way as pmem; for a process running on
node3, allocate preferentially from node3, then pmem3, then other nodes,
then other pmems.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 18:00                   ` Dave Hansen
@ 2019-03-27 20:37                     ` Zi Yan
  2019-03-27 20:42                       ` Dave Hansen
  0 siblings, 1 reply; 66+ messages in thread
From: Zi Yan @ 2019-03-27 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Keith Busch, Yang Shi, mhocko, mgorman, riel, hannes, akpm,
	Busch, Keith, Williams, Dan J, Wu, Fengguang, Du, Fan, Huang,
	Ying, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]

On 27 Mar 2019, at 11:00, Dave Hansen wrote:

> On 3/27/19 10:48 AM, Zi Yan wrote:
>> For 40MB/s vs 750MB/s, they were using sys_migrate_pages(). Sorry
>> about the confusion there. As I measure only the migrate_pages() in
>> the kernel, the throughput becomes: migrating 4KB page: 0.312GB/s
>> vs migrating 512 4KB pages: 0.854GB/s. They are still >2x
>> difference.
>>
>> Furthermore, if we only consider the migrate_page_copy() in
>> mm/migrate.c, which only calls copy_highpage() and
>> migrate_page_states(), the throughput becomes: migrating 4KB page:
>> 1.385GB/s vs migrating 512 4KB pages: 1.983GB/s. The gap is
>> smaller, but migrating 512 4KB pages still achieves 40% more
>> throughput.
>>
>> Do these numbers make sense to you?
>
> Yes.  It would be very interesting to batch the migrations in the
> kernel and see how it affects the code.  A 50% boost is interesting,
> but not if it's only in microbenchmarks and takes 2k lines of code.
>
> 50% is *very* interesting if it happens in the real world and we can
> do it in 10 lines of code.
>
> So, let's see what the code looks like.

Actually, the migration throughput difference does not come from any kernel
changes, it is a pure comparison between migrate_pages(single 4KB page) and
migrate_pages(a list of 4KB pages). The point I wanted to make is that
Yang’s approach, which migrates a list of pages at the end of shrink_page_list(),
can achieve higher throughput than Keith’s approach, which migrates one page
at a time in the while loop inside shrink_page_list().

In addition to the above, migrating a single THP can get us even higher throughput.
Here is the throughput numbers comparing all three cases:
                             |  migrate_page()  |    migrate_page_copy()
migrating single 4KB page:   |  0.312GB/s       |   1.385GB/s
migrating 512 4KB pages:     |  0.854GB/s       |   1.983GB/s
migrating single 2MB THP:    |  2.387GB/s       |   2.481GB/s

Obviously, we would like to migrate THPs as a whole instead of 512 4KB pages
individually. Of course, this assumes we have free space in PMEM for THPs and
all subpages in the THPs are cold.


To batch the migration, I posted some code a while ago: https://lwn.net/Articles/714991/,
which show good throughput improvement for microbenchmarking sys_migrate_page().
It also included using multi threads to copy a page, aggregate multiple migrate_page_copy(),
and even using DMA instead of CPUs to copy data. We could revisit the code if necessary.

In terms of end-to-end results, I also have some results from my paper:
http://www.cs.yale.edu/homes/abhishek/ziyan-asplos19.pdf (Figure 8 to Figure 11 show the
microbenchmark result and Figure 12 shows end-to-end results). I basically called
shrink_active/inactive_list() every 5 seconds to track page hotness and used all my page
migration optimizations above, which can get 40% application runtime speedup on average.
The experiments were done in a two-socket NUMA machine where one node was slowed down to
have 1/2 BW and 2x access latency, compared to the other node. I can discuss about it
more if you are interested.


--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 20:35             ` Matthew Wilcox
@ 2019-03-27 20:40               ` Dave Hansen
  0 siblings, 0 replies; 66+ messages in thread
From: Dave Hansen @ 2019-03-27 20:40 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Rik van Riel,
	Johannes Weiner, Andrew Morton, Keith Busch, Fengguang Wu, Du,
	Fan, Huang, Ying, Linux MM, Linux Kernel Mailing List

On 3/27/19 1:35 PM, Matthew Wilcox wrote:
> 
> pmem1 --- node1 --- node2 --- pmem2
>             |   \ /   |
>             |    X    |
>             |   / \   |
> pmem3 --- node3 --- node4 --- pmem4
> 
> which I could actually see someone building with normal DRAM, and we
> should probably handle the same way as pmem; for a process running on
> node3, allocate preferentially from node3, then pmem3, then other nodes,
> then other pmems.

That makes sense.  But, it might _also_ make sense to fill up all DRAM
first before using any pmem.  That could happen if the NUMA interconnect
is really fast and pmem is really slow.

Basically, with the current patches we are depending on the firmware to
"nicely" enumerate the topology and we're keeping the behavior that we
end up with, for now, whatever it might be.

Now, let's sit back and see how nice the firmware is. :)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 20:37                     ` Zi Yan
@ 2019-03-27 20:42                       ` Dave Hansen
  0 siblings, 0 replies; 66+ messages in thread
From: Dave Hansen @ 2019-03-27 20:42 UTC (permalink / raw)
  To: Zi Yan
  Cc: Keith Busch, Yang Shi, mhocko, mgorman, riel, hannes, akpm,
	Busch, Keith, Williams, Dan J, Wu, Fengguang, Du, Fan, Huang,
	Ying, linux-mm, linux-kernel

On 3/27/19 1:37 PM, Zi Yan wrote:
> Actually, the migration throughput difference does not come from
> any kernel changes, it is a pure comparison between
> migrate_pages(single 4KB page) and migrate_pages(a list of 4KB
> pages). The point I wanted to make is that Yang’s approach, which
> migrates a list of pages at the end of shrink_page_list(), can
> achieve higher throughput than Keith’s approach, which migrates one
> page at a time in the while loop inside shrink_page_list().

I look forward to seeing the patches.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-27 20:09               ` Michal Hocko
@ 2019-03-28  2:09                 ` Yang Shi
  2019-03-28  6:58                   ` Michal Hocko
  2019-03-28  8:21                     ` Dan Williams
  0 siblings, 2 replies; 66+ messages in thread
From: Yang Shi @ 2019-03-28  2:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List



On 3/27/19 1:09 PM, Michal Hocko wrote:
> On Wed 27-03-19 11:59:28, Yang Shi wrote:
>>
>> On 3/27/19 10:34 AM, Dan Williams wrote:
>>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> [...]
>>>>> It is still NUMA, users still can see all the NUMA nodes.
>>>> No, Linux NUMA implementation makes all numa nodes available by default
>>>> and provides an API to opt-in for more fine tuning. What you are
>>>> suggesting goes against that semantic and I am asking why. How is pmem
>>>> NUMA node any different from any any other distant node in principle?
>>> Agree. It's just another NUMA node and shouldn't be special cased.
>>> Userspace policy can choose to avoid it, but typical node distance
>>> preference should otherwise let the kernel fall back to it as
>>> additional memory pressure relief for "near" memory.
>> In ideal case, yes, I agree. However, in real life world the performance is
>> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
>> higher latency and lower bandwidth. We observed much higher latency on PMEM
>> than DRAM with multi threads.
> One rule of thumb is: Do not design user visible interfaces based on the
> contemporary technology and its up/down sides. This will almost always
> fire back.

Thanks. It does make sense to me.

>
> Btw. if you keep arguing about performance without any numbers. Can you
> present something specific?

Yes, I did have some numbers. We did simple memory sequential rw latency 
test with a designed-in-house test program on PMEM (bind to PMEM) and 
DRAM (bind to DRAM). When running with 20 threads the result is as below:

              Threads          w/lat            r/lat
PMEM      20                537.15         68.06
DRAM      20                14.19           6.47

And, sysbench test with command: sysbench --time=600 memory 
--memory-block-size=8G --memory-total-size=1024T --memory-scope=global 
--memory-oper=read --memory-access-mode=rnd --rand-type=gaussian 
--rand-pareto-h=0.1 --threads=1 run

The result is:
                    lat/ms
PMEM      103766.09
DRAM      31946.30

>
>> In real production environment we don't know what kind of applications would
>> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
>> unexpected performance degradation. I understand to have mempolicy to choose
>> to avoid it. But, there might be hundreds or thousands of applications
>> running on the machine, it sounds not that feasible to me to have each
>> single application set mempolicy to avoid it.
> we have cpuset cgroup controller to help here.
>
>> So, I think we still need a default allocation node mask. The default value
>> may include all nodes or just DRAM nodes. But, they should be able to be
>> override by user globally, not only per process basis.
>>
>> Due to the performance disparity, currently our usecases treat PMEM as
>> second tier memory for demoting cold page or binding to not memory access
>> sensitive applications (this is the reason for inventing a new mempolicy)
>> although it is a NUMA node.
> If the performance sucks that badly then do not use the pmem as NUMA,
> really. There are certainly other ways to export the pmem storage. Use
> it as a fast swap storage. Or try to work on a swap caching mechanism
> that still allows much faster access than a slow swap storage. But do
> not try to pretend to abuse the NUMA interface while you are breaking
> some of its long term established semantics.

Yes, we are looking into using it as a fast swap storage too and perhaps 
other usecases.

Anyway, though nobody thought it makes sense to restrict default 
allocation nodes, it sounds over-engineered. I'm going to drop it.

One question, when doing demote and promote we need define a path, for 
example, DRAM <-> PMEM (assume two tier memory). When determining what 
nodes are "DRAM" nodes, does it make sense to assume the nodes with both 
cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28  2:09                 ` Yang Shi
@ 2019-03-28  6:58                   ` Michal Hocko
  2019-03-28 18:58                     ` Yang Shi
  2019-03-28  8:21                     ` Dan Williams
  1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-28  6:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed 27-03-19 19:09:10, Yang Shi wrote:
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

Do we really have to special case this for PMEM? Why cannot we simply go
in the zonelist order? In other words why cannot we use the same logic
for a larger NUMA machine and instead of swapping simply fallback to a
less contended NUMA node? It can be a regular DRAM, PMEM or whatever
other type of memory node.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28  2:09                 ` Yang Shi
@ 2019-03-28  8:21                     ` Dan Williams
  2019-03-28  8:21                     ` Dan Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-28  8:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> On 3/27/19 1:09 PM, Michal Hocko wrote:
> > On Wed 27-03-19 11:59:28, Yang Shi wrote:
> >>
> >> On 3/27/19 10:34 AM, Dan Williams wrote:
> >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> > [...]
> >>>>> It is still NUMA, users still can see all the NUMA nodes.
> >>>> No, Linux NUMA implementation makes all numa nodes available by default
> >>>> and provides an API to opt-in for more fine tuning. What you are
> >>>> suggesting goes against that semantic and I am asking why. How is pmem
> >>>> NUMA node any different from any any other distant node in principle?
> >>> Agree. It's just another NUMA node and shouldn't be special cased.
> >>> Userspace policy can choose to avoid it, but typical node distance
> >>> preference should otherwise let the kernel fall back to it as
> >>> additional memory pressure relief for "near" memory.
> >> In ideal case, yes, I agree. However, in real life world the performance is
> >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> >> higher latency and lower bandwidth. We observed much higher latency on PMEM
> >> than DRAM with multi threads.
> > One rule of thumb is: Do not design user visible interfaces based on the
> > contemporary technology and its up/down sides. This will almost always
> > fire back.
>
> Thanks. It does make sense to me.
>
> >
> > Btw. if you keep arguing about performance without any numbers. Can you
> > present something specific?
>
> Yes, I did have some numbers. We did simple memory sequential rw latency
> test with a designed-in-house test program on PMEM (bind to PMEM) and
> DRAM (bind to DRAM). When running with 20 threads the result is as below:
>
>               Threads          w/lat            r/lat
> PMEM      20                537.15         68.06
> DRAM      20                14.19           6.47
>
> And, sysbench test with command: sysbench --time=600 memory
> --memory-block-size=8G --memory-total-size=1024T --memory-scope=global
> --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian
> --rand-pareto-h=0.1 --threads=1 run
>
> The result is:
>                     lat/ms
> PMEM      103766.09
> DRAM      31946.30
>
> >
> >> In real production environment we don't know what kind of applications would
> >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> >> unexpected performance degradation. I understand to have mempolicy to choose
> >> to avoid it. But, there might be hundreds or thousands of applications
> >> running on the machine, it sounds not that feasible to me to have each
> >> single application set mempolicy to avoid it.
> > we have cpuset cgroup controller to help here.
> >
> >> So, I think we still need a default allocation node mask. The default value
> >> may include all nodes or just DRAM nodes. But, they should be able to be
> >> override by user globally, not only per process basis.
> >>
> >> Due to the performance disparity, currently our usecases treat PMEM as
> >> second tier memory for demoting cold page or binding to not memory access
> >> sensitive applications (this is the reason for inventing a new mempolicy)
> >> although it is a NUMA node.
> > If the performance sucks that badly then do not use the pmem as NUMA,
> > really. There are certainly other ways to export the pmem storage. Use
> > it as a fast swap storage. Or try to work on a swap caching mechanism
> > that still allows much faster access than a slow swap storage. But do
> > not try to pretend to abuse the NUMA interface while you are breaking
> > some of its long term established semantics.
>
> Yes, we are looking into using it as a fast swap storage too and perhaps
> other usecases.
>
> Anyway, though nobody thought it makes sense to restrict default
> allocation nodes, it sounds over-engineered. I'm going to drop it.
>
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what
> nodes are "DRAM" nodes, does it make sense to assume the nodes with both
> cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

For ACPI platforms the HMAT is effectively going to enforce "cpu-less"
nodes for any memory range that has differentiated performance from
the conventional memory pool, or differentiated performance for a
specific initiator. So "memory-less == PMEM" is not a robust
assumption.

The plan is to use the HMAT to populate the default fallback order,
but allow for an override if the HMAT information is missing or
incorrect.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
@ 2019-03-28  8:21                     ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2019-03-28  8:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> On 3/27/19 1:09 PM, Michal Hocko wrote:
> > On Wed 27-03-19 11:59:28, Yang Shi wrote:
> >>
> >> On 3/27/19 10:34 AM, Dan Williams wrote:
> >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> > [...]
> >>>>> It is still NUMA, users still can see all the NUMA nodes.
> >>>> No, Linux NUMA implementation makes all numa nodes available by default
> >>>> and provides an API to opt-in for more fine tuning. What you are
> >>>> suggesting goes against that semantic and I am asking why. How is pmem
> >>>> NUMA node any different from any any other distant node in principle?
> >>> Agree. It's just another NUMA node and shouldn't be special cased.
> >>> Userspace policy can choose to avoid it, but typical node distance
> >>> preference should otherwise let the kernel fall back to it as
> >>> additional memory pressure relief for "near" memory.
> >> In ideal case, yes, I agree. However, in real life world the performance is
> >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> >> higher latency and lower bandwidth. We observed much higher latency on PMEM
> >> than DRAM with multi threads.
> > One rule of thumb is: Do not design user visible interfaces based on the
> > contemporary technology and its up/down sides. This will almost always
> > fire back.
>
> Thanks. It does make sense to me.
>
> >
> > Btw. if you keep arguing about performance without any numbers. Can you
> > present something specific?
>
> Yes, I did have some numbers. We did simple memory sequential rw latency
> test with a designed-in-house test program on PMEM (bind to PMEM) and
> DRAM (bind to DRAM). When running with 20 threads the result is as below:
>
>               Threads          w/lat            r/lat
> PMEM      20                537.15         68.06
> DRAM      20                14.19           6.47
>
> And, sysbench test with command: sysbench --time=600 memory
> --memory-block-size=8G --memory-total-size=1024T --memory-scope=global
> --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian
> --rand-pareto-h=0.1 --threads=1 run
>
> The result is:
>                     lat/ms
> PMEM      103766.09
> DRAM      31946.30
>
> >
> >> In real production environment we don't know what kind of applications would
> >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> >> unexpected performance degradation. I understand to have mempolicy to choose
> >> to avoid it. But, there might be hundreds or thousands of applications
> >> running on the machine, it sounds not that feasible to me to have each
> >> single application set mempolicy to avoid it.
> > we have cpuset cgroup controller to help here.
> >
> >> So, I think we still need a default allocation node mask. The default value
> >> may include all nodes or just DRAM nodes. But, they should be able to be
> >> override by user globally, not only per process basis.
> >>
> >> Due to the performance disparity, currently our usecases treat PMEM as
> >> second tier memory for demoting cold page or binding to not memory access
> >> sensitive applications (this is the reason for inventing a new mempolicy)
> >> although it is a NUMA node.
> > If the performance sucks that badly then do not use the pmem as NUMA,
> > really. There are certainly other ways to export the pmem storage. Use
> > it as a fast swap storage. Or try to work on a swap caching mechanism
> > that still allows much faster access than a slow swap storage. But do
> > not try to pretend to abuse the NUMA interface while you are breaking
> > some of its long term established semantics.
>
> Yes, we are looking into using it as a fast swap storage too and perhaps
> other usecases.
>
> Anyway, though nobody thought it makes sense to restrict default
> allocation nodes, it sounds over-engineered. I'm going to drop it.
>
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what
> nodes are "DRAM" nodes, does it make sense to assume the nodes with both
> cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

For ACPI platforms the HMAT is effectively going to enforce "cpu-less"
nodes for any memory range that has differentiated performance from
the conventional memory pool, or differentiated performance for a
specific initiator. So "memory-less == PMEM" is not a robust
assumption.

The plan is to use the HMAT to populate the default fallback order,
but allow for an override if the HMAT information is missing or
incorrect.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28  6:58                   ` Michal Hocko
@ 2019-03-28 18:58                     ` Yang Shi
  2019-03-28 19:12                       ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-28 18:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List



On 3/27/19 11:58 PM, Michal Hocko wrote:
> On Wed 27-03-19 19:09:10, Yang Shi wrote:
>> One question, when doing demote and promote we need define a path, for
>> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
>> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
>> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> Do we really have to special case this for PMEM? Why cannot we simply go
> in the zonelist order? In other words why cannot we use the same logic
> for a larger NUMA machine and instead of swapping simply fallback to a
> less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> other type of memory node.

Thanks for the suggestion. It makes sense. However, if we don't 
specialize a pmem node, its fallback node may be a DRAM node, then the 
memory reclaim may move the inactive page to the DRAM node, it sounds 
not make too much sense since memory reclaim would prefer to move 
downwards (DRAM -> PMEM -> Disk).

Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28 18:58                     ` Yang Shi
@ 2019-03-28 19:12                       ` Michal Hocko
  2019-03-28 19:40                         ` Yang Shi
  0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2019-03-28 19:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Thu 28-03-19 11:58:57, Yang Shi wrote:
> 
> 
> On 3/27/19 11:58 PM, Michal Hocko wrote:
> > On Wed 27-03-19 19:09:10, Yang Shi wrote:
> > > One question, when doing demote and promote we need define a path, for
> > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> > Do we really have to special case this for PMEM? Why cannot we simply go
> > in the zonelist order? In other words why cannot we use the same logic
> > for a larger NUMA machine and instead of swapping simply fallback to a
> > less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> > other type of memory node.
> 
> Thanks for the suggestion. It makes sense. However, if we don't specialize a
> pmem node, its fallback node may be a DRAM node, then the memory reclaim may
> move the inactive page to the DRAM node, it sounds not make too much sense
> since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).

There are certainly many details to sort out. One thing is how to handle
cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
without an explicit binding, right? My first naive idea would be to only
migrate-on-reclaim only from the preferred node. We might need
additional heuristics but I wouldn't special case PMEM from other
cpuless NUMA nodes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28 19:12                       ` Michal Hocko
@ 2019-03-28 19:40                         ` Yang Shi
  2019-03-28 20:40                           ` Michal Hocko
  0 siblings, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-28 19:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List



On 3/28/19 12:12 PM, Michal Hocko wrote:
> On Thu 28-03-19 11:58:57, Yang Shi wrote:
>>
>> On 3/27/19 11:58 PM, Michal Hocko wrote:
>>> On Wed 27-03-19 19:09:10, Yang Shi wrote:
>>>> One question, when doing demote and promote we need define a path, for
>>>> example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
>>>> are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
>>>> memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
>>> Do we really have to special case this for PMEM? Why cannot we simply go
>>> in the zonelist order? In other words why cannot we use the same logic
>>> for a larger NUMA machine and instead of swapping simply fallback to a
>>> less contended NUMA node? It can be a regular DRAM, PMEM or whatever
>>> other type of memory node.
>> Thanks for the suggestion. It makes sense. However, if we don't specialize a
>> pmem node, its fallback node may be a DRAM node, then the memory reclaim may
>> move the inactive page to the DRAM node, it sounds not make too much sense
>> since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).
> There are certainly many details to sort out. One thing is how to handle
> cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
> without an explicit binding, right? My first naive idea would be to only

Wait a minute. I thought we were arguing about the default allocation 
node mask yesterday. And, the conclusion is PMEM node should not be 
excluded from the node mask. PMEM nodes are cpuless nodes. I think I 
should replace all "PMEM node" to "cpuless node" in the cover letter and 
commit logs to make it explicitly.

Quoted from Dan "For ACPI platforms the HMAT is effectively going to 
enforce "cpu-less" nodes for any memory range that has differentiated 
performance from the conventional memory pool, or differentiated 
performance for a specific initiator."

I apologize I didn't elaborate PMEM nodes are cpuless nodes at the first 
place. Of course, cpuless node may be not PMEM node.

To your question, yes, I do agree. Actually, this is what I mean about 
"DRAM only by default", or I should rephrase it to "exclude cpuless 
node", I thought they mean the same thing.

> migrate-on-reclaim only from the preferred node. We might need

If we exclude cpuless nodes, yes. The preferred node would be DRAM node 
only. Actually, the patchset does follow "migrate-on-reclaim only from 
the preferred node".

Thanks,
Yang

> additional heuristics but I wouldn't special case PMEM from other
> cpuless NUMA nodes.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
  2019-03-28 19:40                         ` Yang Shi
@ 2019-03-28 20:40                           ` Michal Hocko
  0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2019-03-28 20:40 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dan Williams, Mel Gorman, Rik van Riel, Johannes Weiner,
	Andrew Morton, Dave Hansen, Keith Busch, Fengguang Wu, Du, Fan,
	Huang, Ying, Linux MM, Linux Kernel Mailing List

On Thu 28-03-19 12:40:14, Yang Shi wrote:
> 
> 
> On 3/28/19 12:12 PM, Michal Hocko wrote:
> > On Thu 28-03-19 11:58:57, Yang Shi wrote:
> > > 
> > > On 3/27/19 11:58 PM, Michal Hocko wrote:
> > > > On Wed 27-03-19 19:09:10, Yang Shi wrote:
> > > > > One question, when doing demote and promote we need define a path, for
> > > > > example, DRAM <-> PMEM (assume two tier memory). When determining what nodes
> > > > > are "DRAM" nodes, does it make sense to assume the nodes with both cpu and
> > > > > memory are DRAM nodes since PMEM nodes are typically cpuless nodes?
> > > > Do we really have to special case this for PMEM? Why cannot we simply go
> > > > in the zonelist order? In other words why cannot we use the same logic
> > > > for a larger NUMA machine and instead of swapping simply fallback to a
> > > > less contended NUMA node? It can be a regular DRAM, PMEM or whatever
> > > > other type of memory node.
> > > Thanks for the suggestion. It makes sense. However, if we don't specialize a
> > > pmem node, its fallback node may be a DRAM node, then the memory reclaim may
> > > move the inactive page to the DRAM node, it sounds not make too much sense
> > > since memory reclaim would prefer to move downwards (DRAM -> PMEM -> Disk).
> > There are certainly many details to sort out. One thing is how to handle
> > cpuless nodes (e.g. PMEM). Those shouldn't get any direct allocations
> > without an explicit binding, right? My first naive idea would be to only
> 
> Wait a minute. I thought we were arguing about the default allocation node
> mask yesterday. And, the conclusion is PMEM node should not be excluded from
> the node mask. PMEM nodes are cpuless nodes. I think I should replace all
> "PMEM node" to "cpuless node" in the cover letter and commit logs to make it
> explicitly.

No, this is not about the default allocation mask at all. Your
allocations start from a local/mempolicy node. CPUless nodes thus cannot be a
primary node so it will always be only in a fallback zonelist without an
explicit binding.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-27 13:08           ` Keith Busch
  2019-03-27 17:00             ` Zi Yan
@ 2019-03-28 21:59             ` Yang Shi
  2019-03-28 22:45               ` Keith Busch
  1 sibling, 1 reply; 66+ messages in thread
From: Yang Shi @ 2019-03-28 21:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: mhocko, mgorman, riel, hannes, akpm, Hansen, Dave, Busch, Keith,
	Williams, Dan J, Wu, Fengguang, Du, Fan, Huang, Ying, linux-mm,
	linux-kernel



On 3/27/19 6:08 AM, Keith Busch wrote:
> On Tue, Mar 26, 2019 at 08:41:15PM -0700, Yang Shi wrote:
>> On 3/26/19 5:35 PM, Keith Busch wrote:
>>> migration nodes have higher free capacity than source nodes. And since
>>> your attempting THP's without ever splitting them, that also requires
>>> lower fragmentation for a successful migration.
>> Yes, it is possible. However, migrate_pages() already has logic to
>> handle such case. If the target node has not enough space for migrating
>> THP in a whole, it would split THP then retry with base pages.
> Oh, you're right, my mistake on splitting. So you have a good best effort
> migrate, but I still think it can fail for legitimate reasons that should
> have a swap fallback.

Yes, it still could fail. I can't tell which way is better for now. I 
just thought scanning another round then migrating should be still 
faster than swapping off the top of my head.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node
  2019-03-28 21:59             ` Yang Shi
@ 2019-03-28 22:45               ` Keith Busch
  0 siblings, 0 replies; 66+ messages in thread
From: Keith Busch @ 2019-03-28 22:45 UTC (permalink / raw)
  To: Yang Shi
  Cc: mhocko, mgorman, riel, hannes, akpm, Hansen, Dave, Busch, Keith,
	Williams, Dan J, Wu, Fengguang, Du, Fan, Huang, Ying, linux-mm,
	linux-kernel

On Thu, Mar 28, 2019 at 02:59:30PM -0700, Yang Shi wrote:
> Yes, it still could fail. I can't tell which way is better for now. I 
> just thought scanning another round then migrating should be still 
> faster than swapping off the top of my head.

I think it depends on the relative capacities between your primary and
migration tiers and how it's used. Applications may allocate and pin
directly out of pmem if they wish, so it's not a dedicated fallback
memory space like swap.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice
  2019-03-23  4:44 ` [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice Yang Shi
@ 2019-03-29  0:31   ` kbuild test robot
  0 siblings, 0 replies; 66+ messages in thread
From: kbuild test robot @ 2019-03-29  0:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: kbuild-all, mhocko, mgorman, riel, hannes, akpm, dave.hansen,
	keith.busch, dan.j.williams, fengguang.wu, fan.du, ying.huang,
	yang.shi, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

Hi Yang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.1-rc2 next-20190328]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Yang-Shi/Another-Approach-to-Use-PMEM-as-NUMA-Node/20190326-034920
config: i386-randconfig-x076-201912 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/linux/memcontrol.h:29:0,
                    from include/linux/swap.h:9,
                    from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:13:
>> include/linux/mm.h:862:2: error: #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
    #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
     ^~~~~
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +862 include/linux/mm.h

348f8b6c4 Dave Hansen       2005-06-23  860  
9223b4190 Christoph Lameter 2008-04-28  861  #if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
9223b4190 Christoph Lameter 2008-04-28 @862  #error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
348f8b6c4 Dave Hansen       2005-06-23  863  #endif
348f8b6c4 Dave Hansen       2005-06-23  864  

:::::: The code at line 862 was first introduced by commit
:::::: 9223b4190fa1297a59f292f3419fc0285321d0ea pageflags: get rid of FLAGS_RESERVED

:::::: TO: Christoph Lameter <clameter@sgi.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25789 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2019-03-29  0:32 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
2019-03-23 17:21   ` Dan Williams
2019-03-23 17:21     ` Dan Williams
2019-03-25 19:28     ` Yang Shi
2019-03-25 23:18       ` Dan Williams
2019-03-25 23:18         ` Dan Williams
2019-03-25 23:36         ` Yang Shi
2019-03-25 23:42           ` Dan Williams
2019-03-25 23:42             ` Dan Williams
2019-03-23  4:44 ` [PATCH 02/10] mm: mempolicy: introduce MPOL_HYBRID policy Yang Shi
2019-03-23  4:44 ` [PATCH 03/10] mm: mempolicy: promote page to DRAM for MPOL_HYBRID Yang Shi
2019-03-23  4:44 ` [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice Yang Shi
2019-03-29  0:31   ` kbuild test robot
2019-03-23  4:44 ` [PATCH 05/10] mm: page_alloc: make find_next_best_node could skip DRAM node Yang Shi
2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
2019-03-23  6:03   ` Zi Yan
2019-03-25 21:49     ` Yang Shi
2019-03-24 22:20   ` Keith Busch
2019-03-25 19:49     ` Yang Shi
2019-03-27  0:35       ` Keith Busch
2019-03-27  3:41         ` Yang Shi
2019-03-27 13:08           ` Keith Busch
2019-03-27 17:00             ` Zi Yan
2019-03-27 17:05               ` Dave Hansen
2019-03-27 17:48                 ` Zi Yan
2019-03-27 18:00                   ` Dave Hansen
2019-03-27 20:37                     ` Zi Yan
2019-03-27 20:42                       ` Dave Hansen
2019-03-28 21:59             ` Yang Shi
2019-03-28 22:45               ` Keith Busch
2019-03-23  4:44 ` [PATCH 07/10] mm: vmscan: add page demotion counter Yang Shi
2019-03-23  4:44 ` [PATCH 08/10] mm: numa: add page promotion counter Yang Shi
2019-03-23  4:44 ` [PATCH 09/10] doc: add description for MPOL_HYBRID mode Yang Shi
2019-03-23  4:44 ` [PATCH 10/10] doc: elaborate the PMEM allocation rule Yang Shi
2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
2019-03-25 16:56   ` Dan Williams
2019-03-25 16:56     ` Dan Williams
2019-03-25 17:45     ` Brice Goglin
2019-03-25 19:29       ` Dan Williams
2019-03-25 19:29         ` Dan Williams
2019-03-25 23:09         ` Brice Goglin
2019-03-25 23:37           ` Dan Williams
2019-03-25 23:37             ` Dan Williams
2019-03-26 12:19             ` Jonathan Cameron
2019-03-25 20:04   ` Yang Shi
2019-03-26 13:58 ` Michal Hocko
2019-03-26 18:33   ` Yang Shi
2019-03-26 18:37     ` Michal Hocko
2019-03-27  2:58       ` Yang Shi
2019-03-27  9:01         ` Michal Hocko
2019-03-27 17:34           ` Dan Williams
2019-03-27 17:34             ` Dan Williams
2019-03-27 18:59             ` Yang Shi
2019-03-27 20:09               ` Michal Hocko
2019-03-28  2:09                 ` Yang Shi
2019-03-28  6:58                   ` Michal Hocko
2019-03-28 18:58                     ` Yang Shi
2019-03-28 19:12                       ` Michal Hocko
2019-03-28 19:40                         ` Yang Shi
2019-03-28 20:40                           ` Michal Hocko
2019-03-28  8:21                   ` Dan Williams
2019-03-28  8:21                     ` Dan Williams
2019-03-27 20:14               ` Dave Hansen
2019-03-27 20:35             ` Matthew Wilcox
2019-03-27 20:40               ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.